Logistics

Date: Monday, October 6th 2025

Location: NITMB (The National Institute for Theory and Mathematics in Biology) (875 N. Michigan Avenue, 35th floor, Chicago, Illinois) (Suite 3500)

Building Entrance: 172 E Chestnut St suite 3500, Chicago, IL 60611

Building Access Reminder:  All attendees must check in at the front desk with a valid photo ID. Your building pass will only be valid for the day of the event.

Parking and Transportations : https://www.nitmb.org/getting-here

Registration: https://docs.google.com/forms/d/e/1FAIpQLSexnDf-PS7bIBrmiMp9rbD-bZ_8KGrQMHqUF7VLyqdv53V08w/viewform

Description:

This one-day workshop, a joint initiative of the National Institute for Theory and Mathematics in Biology (NITMB) and the Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), focuses on the fundamental challenges at the intersection of modern data science theory and applications. The event will explore the development of novel mathematical and algorithmic foundations for extracting insights from data characterized by high dimensionality and intricate structures. While these challenges appear across many domains, modern biology—from genomics and single-cell analysis to bioimaging—serves as a key driver for developing powerful, broadly applicable methods.

The workshop brings together mathematicians, computer scientists, biologists, statisticians and scientists from related fields to foster interdisciplinary collaboration. The day’s format includes four main talks, a lightning talk session, and a poster session, with significant time reserved for discussion and networking.

 
 
Schedule:  

Time | Event

9:45 – 10:00 | Introduction to the Fall Special Program

10:00 – 10:30 | Arun Kuchibhotla (CMU) Computationally Efficient Methods for Uncertainty Quantification

10:30 – 11:00 | Sidhanth Mohanty (Northwestern) Mixing from beyond worst-case initializations

11:00 – 11:15 | Break

11:15 – 11:45 | Xin He (UChicago) A Bayesian factor analysis method improves detection of differentially expressed genes from single-cell CRISPR screening

11:45 – 12:15 PM | Yury Makarychev (TTIC) Clustering under Dimension Reduction

12:15 PM – 1:30PM | Lunch

1:30 PM- 2:10 PM | Lightning talks

2:10 – 2:40 PM |Coffee Break

2:40 – 3:10 PM | Open problem sessions

3:10 – 4:30 PM |Posters + Free Discussion

 

Abstracts: 

Speaker: Arun Kumar Kuchibhotla (Carnegie Mellon University)

Title: Computationally Efficient Methods for Uncertainty Quantification

Abstract: In this talk, I will present a simple result related to order statistics of univariate random variables and discuss its implications for computationally efficient uncertainty quantification. I will discuss efficient construction of confidence sets for parameters under weak regularity conditions. I will also discuss the construction of prediction sets for settings with distributional shifts. 

*************

Speaker: Sidhanth Mohanty (Northwestern University) 

Title: Mixing from beyond worst-case initializations

Abstract: Much of the existing theory for bounding the mixing time of a Markov chain is catered to proving mixing times from a worst-case initialization. In this talk, I will discuss some recent progress on analyzing mixing of Markov chains from average-case initializations, based on the framework of “weak Poincaré inequalities” and the simulated annealing algorithm in the context of sampling from some distributions that arise in inference and statistical physics.

*************

Speaker: Xin He (University of Chicago)

Title: A Bayesian factor analysis method improves detection of differentially expressed genes from single-cell CRISPR screening

Abstract: One of the most commonly studied statistics problem in genomics is differentitaiton expression (DE) analysis, where one detects genes whose expression are affected by certain perturbations. Recently, DE analysis is applied in single-cell CRISPR experiments. In this type of experiments, CRISPR perturbations are coupled with single-cell RNA sequencing to characterize the effects of multiple perturbations. However, due to its sparsity and complex structure, analysis of single-cell CRISPR screening data is challenging. In particular, standard DE analysis methods are often underpowered to detect genes affected by CRISPR perturbations. Our method, guided sparse factor analysis (GSFA), infers latent factors that represent coregulated genes or gene modules; and by borrowing information from these factors, it infers the effects of perturbations on individual genes. We demonstrated through both simulation studies and analsyis of several datasets that GSFA detects perturbation effects with much higher power than state-of-the-art methods. 

*************

Speaker: Yury Makarychev (TTIC) 

Title: Clustering under Dimension Reduction

Abstract: I will discuss clustering under dimension reduction. Consider an instance of Euclidean $k$-means or $k$-medians clustering. We show that projecting onto a space of dimension $d \sim \log k$ preserves the cost of every clustering within a factor of $1 + \varepsilon$, with high probability. Crucially, the target dimension $d$ is independent of the number of points $n$ in the input. Our result applies more generally to other variants of the $k$-clustering problem. This strengthens the theorem of Cohen, Elder, Musco, Musco, and Persu, who proved that the value of $k$-means is approximately preserved when $d \sim k$. No bounds on $d$ were previously known for $k$-medians. Joint work with Konstantin Makarychev and Ilya Razenshteyn.

 

Poster Session Abstracts

Dravyansh Sharma (TTIC) 

Title: Distribution-dependent generalization for tuning high-dimensional linear regression

 Abstract: Modern regression problems often involve high-dimensional data and a careful tuning of the regularization hyperparameters is crucial to avoid overly complex models that may overfit the training data while guaranteeing desirable properties like effective variable selection. We study the recently introduced direction of tuning regularization hyperparameters in linear regression across multiple related tasks. We obtain distribution-dependent bounds on the generalization error for the validation loss when tuning the L1 and L2 coefficients, including ridge, lasso and the elastic net. In contrast, prior work develops bounds that apply uniformly to all distributions, but such bounds necessarily degrade with feature dimension, d. While these bounds are shown to be tight for worst-case distributions, our bounds improve with the “niceness” of the data distribution. Concretely, we show that under additional assumptions that instances within each task are i.i.d. draws from broad well-studied classes of distributions including sub-Gaussians, our generalization bounds do not get worse with increasing d, and are much sharper than prior work for very large d. We also extend our results to a generalization of ridge regression, where we achieve tighter bounds that take into account an estimate of the mean of the ground truth distribution.

 

Dong Xie (UChicago)

Title: Confidence Intervals for Linear Models with Arbitrary Noise Contamination

Abstract: We study confidence interval construction for linear regression under Huber’s contamination model, where an unknown fraction of noise variables is arbitrarily corrupted. While robust point estimation in this setting is well understood, statistical inference remains challenging, especially because the contamination proportion $\epsilon$ is not identifiable from the data. We develop a new algorithm that constructs confidence intervals for individual regression coefficients without any prior knowledge of the contamination level. Our method is based on a Z-estimation framework using a smooth estimating function, and it directly quantifies the uncertainty of the estimating equation after a preprocessing step that decorrelates covariates associated with the nuisance parameters. We show that the resulting confidence interval has valid coverage uniformly over all contamination distributions and attains an optimal length of order $O(1/\sqrt{n(1-\epsilon)^2})$, matching the rate achievable when the contamination proportion is known. This result stands in sharp contrast to the adaptation cost of robust interval estimation observed in the simpler Gaussian location model.

 

 

Organizers: 

Antonio Auffinger (Northwestern University)
Chao Gao (University of Chicago)
Kostya Makarychev (Northwestern University)
Yury Makarychev (Toyota Technological Institute at Chicago)

Miki Racz (Northwestern University)
Aravindan Vijayaraghavan (Northwestern University)

Join Our Newsletter