Uncertainty Quantification and Reliability

Logistics

Date: Wednesday, October 29th 2025

Location: Northwestern University, 3rd floor Mudd library (Room: 3514) 2233 Tech Dr, Evanston, IL 60208.

Parking: For those driving to the workshop, attendees can park in the North Campus garage 2311 N Campus Dr #2300, Evanston, IL 60208. https://maps.northwestern.edu/txt/facility/646 You’ll exit the garage on the opposite side from the car entrance and you’ll see Mudd Library directly in front of you across a grassy lawn area. Take the elevator to your right in the library lobby to the 3rd floor.

Registration

Description:

This one-day workshop is part of the Fall 2025 IDEAL Special Program on High Dimensional and Complex Data Analysis. The event will explore the development of uncertainty quantification in statistics, machine learning and computer science. We will focus on the both theoretical and computational aspects of confidence sets, prediction sets and distribution-free methods.

The workshop brings together mathematicians, computer scientists, biologists, statisticians and scientists from related fields to foster interdisciplinary collaboration.

Schedule:

Time | Event

9:00 – 9:45 am: Parikshit Gopalan Calibration through the lens of indistinguishability.

9:45 – 10:15 am: Break

10:15 – 11:00 am: Nikos Ignatiadis Empirical partially Bayes multiple testing and compound χ² decisions

11:00 – 11:30 am: Break

11:30 – 11:50 pm: Licong Lin Breaking the quadratic barrier for $Z$-estimators via the jackknife.

11:50 – 12:10 pm: Rohan Hore Conformal changepoint localization

12:10 – 2:00 pm: Lunch

2:00 – 3:00 pm Poster session

3:00 – 3:45 pm: Hamed Hassani Uncertainty Quantification for Generative and Collaborative AI

3:45 – 4:20 pm: Break

4:20 – 4:40 pm: Yifan Wu Making and Evaluating Calibrated Forecasts

4:40 – 5:00 pm: Anirban Chatterjee Detecting Miscalibration: A Nearest Neighbor Approach

Abstracts:

Speaker: Parikshit Gopalan

Title: Calibration through the lens of indistinguishability.

Abstract: Calibration is a classical notion from the forecasting literature which aims to address the question: how should predicted probabilities be interpreted? In a world where we only get to observe (discrete) outcomes, how should we evaluate a predictor that hypothesizes (continuous) probabilities over possible outcomes? The study of calibration has seen a surge of recent interest, driven by its importance in ML and applications to questions in theory. In this talk, we will explore some foundational questions about how we should define and measure calibration error, and some applications to learning and complexity theory.

Speaker: Nikolaos Ignatiadis

Title: Empirical partially Bayes multiple testing and compound χ² decisions

Abstract: A common task in high-throughput biology is to screen for associations across thousands of units of interest, e.g., genes or proteins. Often, the data for each unit are modeled as Gaussian measurements with unknown mean and variance and are summarized as per-unit sample averages and sample variances. The downstream goal is multiple testing for the means. In this domain, it is routine to “moderate” (that is, to shrink) the sample variances through parametric empirical Bayes methods before computing p-values for the means. Such an approach is asymmetric in that a prior is posited and estimated for the nuisance parameters (variances) but not the primary parameters (means). Our work initiates the formal study of this paradigm, which we term “empirical partially Bayes multiple testing.” In this framework, if the prior for the variances were known, one could proceed by computing p-values conditional on the sample variances—a strategy called partially Bayes inference by Sir David Cox. We show that these conditional p-values satisfy an Eddington/Tweedie-type formula and are approximated at nearly-parametric rates when the prior is estimated by nonparametric maximum likelihood. The estimated p-values can be used with the Benjamini-Hochberg procedure to guarantee asymptotic control of the false discovery rate. Even in the compound setting, wherein the variances are fixed, the approach retains asymptotic type-I error guarantees.

Speaker: Licong Lin

Title: Breaking the quadratic barrier for $Z$-estimators via the jackknife.

Abstract: Resampling methods are especially well-suited to inference with estimators that provide only “black-box” access. Jackknife is a form of resampling, widely used for bias correction and variance estimation, that is well-understood under classical scaling where the sample size $n$ grows for a fixed problem. We study its behavior in application to estimating functionals using high-dimensional $Z$-estimators, allowing both the sample size $n$ and problem dimension $d$ to diverge. We begin showing that the plug-in estimator based on the $Z$-estimate suffers from a quadratic breakdown: while it is $\sqrt{n}$-consistent and asymptotically normal whenever $n\gtrsim d^2$, it fails for a broad class of problems whenever $n \lesssim d^2$. We then show that under suitable regularity conditions, applying a jackknife correction yields an estimate that is $\sqrt{n}$-consistent and asymptotically normal whenever $n\gtrsim d^{3/2}$. This provides strong motivation for the use of jackknife in high-dimensional problems where the dimension is moderate relative to sample size. We illustrate consequences of our general theory for various specific $Z$-estimators, including non-linear functionals in linear models; generalized linear models; and the inverse propensity score weighting (IPW) estimate for the average treatment effect, among others.

Speaker: Rohan Hore

Title: Conformal changepoint localization

Abstract: We study the problem of offline changepoint localization, where the goal is to identify the index at which the data-generating distribution changes. Existing methods often rely on restrictive parametric assumptions or asymptotic approximations, limiting their practical applicability. To address this, we propose a distribution-free framework, CONformal CHangepoint localization (CONCH), which leverages conformal p-values to efficiently construct valid confidence sets for the changepoint. Under mild assumptions of exchangeability within each segment and independence across segments, CONCH guarantees finite-sample coverage. We further derive principled score functions that yield informative confidence sets. With appropriate score functions, we prove that the normalized length of the confidence set indeed shrinks to zero. We further establish a universality result showing that any distribution-free changepoint localization method can be viewed as an instance of CONCH. Experiments on synthetic and real data confirm that CONCH delivers precise and reliable confidence sets even in challenging settings.

Speaker: Hamed Hassani

Title: Uncertainty Quantification for Generative and Collaborative AI

Abstract: As AI models become increasingly powerful, their predictions are now being integrated into real-world decision-making pipelines — from medicine to autonomous systems. Ensuring that these systems remain reliable requires a principled understanding of uncertainty, especially as modern AI moves toward generative models and collaborative settings where humans interact with AI. Together, these trends raise fundamental questions about how uncertainty should be quantified when decisions are made jointly by humans and machines, and how such quantification can extend to the complex output spaces of generative AI. This talk will focus on conformal prediction as a tool for uncertainty quantification and will aim to provide answers to these questions.

In the first part of the talk, I will discuss Uncertainty Quantification for Generative Models, where the key question is: How can we provide conformal guarantees when the model exposes only a query oracle, such as an LLM that outputs text samples? We develop Conformal Prediction with Query Oracle (CPQ), a framework that connects conformal prediction with the classical missing-mass problem, yielding new estimators and algorithms that provably balance coverage, informativeness, and query cost in black-box generative settings.

In the second part, I will present our recent work on Human-AI Collaborative Uncertainty Quantification, which asks: How should an AI system refine a human expert’s uncertainty estimate without undermining their expertise? We introduce a framework to construct prediction sets that formalizes collaboration through two principles: counterfactual harm (the AI must not harm the human’s correct judgments) and complementarity (the AI should recover correct outcomes the human misses). We will develop distribution-free algorithms with finite-sample guarantees, validated across vision, regression, and medical decision-making tasks.

Together, these works illustrate emerging principles for uncertainty quantification in the era of human-AI collaboration and generative modeling, where reliability depends not just on statistical coverage, but on the dynamics between humans, models, and the unseen parts of their predictive worlds.

Speaker: Yifan Wu

Title: Making and Evaluating Calibrated Forecasts

Abstract: Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. In this talk, I will present a perfectly truthful calibration measure in the batch setting, introduced in Hartline et al. [2025].

The perfectly truthful calibration measure can be generalized to multi-class prediction tasks. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.

Speaker: Anirban Chatterjee

Title: Detecting Miscalibration: A Nearest Neighbor Approach

Abstract: In this talk, we introduce a nearest-neighbor based approach for testing the calibration of predictive models. For classification models, we focus on the well-known $L_2$ expected calibration measure and propose a resampling-based, consistent estimator of this measure using a nearest neighbor framework. The proposed estimator is consistent for the true calibration measure under minimal assumptions. Furthermore, we show that under the null hypothesis of a calibrated model, the properly normalized statistic is asymptotically standard Gaussian. This result enables a practical and readily applicable test for model calibration. In addition, we present a method to derandomize the resampling procedure, which can potentially improve the power of the test. Finally, we discuss extensions of this measure for assessing calibration beyond classification models.

Organizers:

Chao Gao

Miklos Z. Racz

He Jia

Liren Shan

Sidhanth Mohanty

Uncertainty Quantification and Reliability

Join Our Newsletter

Success!

Special Program Announcement

Click here to view the exciting series of workshops, courses, seminars and other activities!