Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)

Drew Prinster; Samuel Stanton; Anqi Liu; Suchi Saria

Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)

Drew Prinster, Samuel Stanton, Anqi Liu, Suchi Saria

TL;DR

This work tackles reliable uncertainty quantification for AI systems that actively influence data distributions. It shows that conformal prediction can achieve valid coverage for any joint distribution with a density $f$, not just exchangeable ones, and provides a general procedure to derive weighted CP algorithms under MFCS. The authors instantiate this with MFCS theory and practical, polynomial-time estimators (the $d$-step weights) and demonstrate through protein-design and active-learning experiments that coverage is preserved where baselines fail. The results advance robust uncertainty quantification for autonomous data-collecting agents and offer actionable methods for building reliable AI systems in dynamic environments.

Abstract

As artificial intelligence (AI) / machine learning (ML) gain widespread adoption, practitioners are increasingly seeking means to quantify and control the risk these systems incur. This challenge is especially salient when such systems have autonomy to collect their own data, such as in black-box optimization and active learning, where their actions induce sequential feedback-loop shifts in the data distribution. Conformal prediction is a promising approach to uncertainty and risk quantification, but prior variants' validity guarantees have assumed some form of ``quasi-exchangeability'' on the data distribution, thereby excluding many types of sequential shifts. In this paper we prove that conformal prediction can theoretically be extended to \textit{any} joint data distribution, not just exchangeable or quasi-exchangeable ones. Although the most general case is exceedingly impractical to compute, for concrete practical applications we outline a procedure for deriving specific conformal algorithms for any data distribution, and we use this procedure to derive tractable algorithms for a series of AI/ML-agent-induced covariate shifts. We evaluate the proposed algorithms empirically on synthetic black-box optimization and active learning tasks.

Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)

TL;DR

, not just exchangeable ones, and provides a general procedure to derive weighted CP algorithms under MFCS. The authors instantiate this with MFCS theory and practical, polynomial-time estimators (the

-step weights) and demonstrate through protein-design and active-learning experiments that coverage is preserved where baselines fail. The results advance robust uncertainty quantification for autonomous data-collecting agents and offer actionable methods for building reliable AI systems in dynamic environments.

Abstract

Paper Structure (36 sections, 2 theorems, 51 equations, 11 figures, 4 tables)

This paper contains 36 sections, 2 theorems, 51 equations, 11 figures, 4 tables.

Introduction
Background
Standard Conformal Prediction
Weighted Conformal Prediction for Covariate Shift
Multistep Feedback Covariate Shift
Related Work
Theory and Method Contributions
The Role of (Weighted) Exchangeability and Related Assumptions in Conformal Prediction
A General View of Conformal Prediction
How to Find Weighted CP Validity Guarantees and Algorithms for MFCS (or Any Data Distribution)
Experimental Results
Multistep Protein Design Experiments
Active Learning-Induced MFCS
Discussion
Main Result Proof and Details
...and 21 more sections

Key Result

Theorem 4.1

Assume that $Z_i=(X_i, Y_i)\in\mathbb{R}^d\times \mathbb{R}, i=1, ..., n+1$ have the joint PDF $f$. For any measurable score function $\mathcal{S}$, and any $\alpha\in(0,1)$, define the generalized conformal prediction set (based on $n$ calibration samples) at a point $x\in \mathbb{R}^d$ by where $V_i^{(x,y)}, i\in \{1, ..., n+1\}$ are as in eq:def_scores_ordinary_full and $\mathbb{P}_{n+1}\{z_i|

Figures (11)

Figure 1: Results for protein design ($n=32$, $\lambda = 8$) with linear ridge regression, comparing the proposed MFCS Full CP method ($d=2$) to standard Full CP and One-Step FCS Full CP. Values are computed over 1,000 random seeds; error bars for mean coverage and mean predicted fitness are standard errors, while error bars for median interval width are interquartile ranges. MFCS Full CP maintains coverage where the baselines do not. Its intervals are wider than those of the one-step FCS baseline where the latter loses coverage ($t=2, 3, 4$), but similar where the baseline maintains coverage ($t=1, 5$). (Further experimental details in Appendix \ref{['subsec:black_box_opt_exp_details']} Table \ref{['tab:fig1']}.)
Figure 2: Results for fluorescent protein design with a multi-layer perceptron (MLP) regressor (initial training and calibration data sizes: $n_{\text{train}}=n_{\text{cal}}=32$; $\lambda = 5$), comparing the proposed MFCS Split CP methods to Standard Split CP and the One-Step FCS Split CP baselines. Values are computed over 500 random seeds. Error bars for mean coverage and mean predicted fitness are standard errors; for median interval width they are the interquartile ranges. Error bars extending beyond the top of the figure indicate infinite upper quartiles. MFCS Split CP maintains coverage where the baselines do not, but its intervals are occasionally very wide, suggesting the proposal distribution is too aggressive for dependably informative uncertainty estimation. (More experimental details in Appendix \ref{['subsec:black_box_opt_exp_details']} Table \ref{['tab:fig2']}.)
Figure 3: Active learning experiments of proposed MFCS Split CP method for $d=3$ (red) compared to baselines of unweighted Split CP (orange), One-Step Split CP (green), and ACI (gray). Y-axes represent mean coverage, median interval width, and mean squared error on a holdout test set to track the accuracy of the base predictor. X-axes correspond to the number of active learning query steps, with each query based on posterior variance of a GP regressor. All values are computed over 350 distinct randomm seeds (full experimental details in Appendix \ref{['subsec:active_learning_exp_details']}). The proposed MFCS CP method maintains target coverage over a long time horizon even where baselines do not.
Figure 4: A causal directed acyclic graph (DAG) model that implies $Z_1, ..., Z_{n+t-1} \perp\!\!\!\perp Y_{n+t} \mid X_{n+t}$ (as does MFCS). Moreover, the blue edges represent relationships that are further assumed to be equivalent, implying $Y_{n+t}|X_{n+t}\stackrel{\mathclap{d}}{=} Y|X$ for all $t\in \{1, ..., T\}$.
Figure 5: Unbounded (top row) versus bounded (bottom row) active learning experiments of proposed multistep split CP method for $d=3$ (red circles) compared to baselines of unweighted split CP (orange squares), one-step split CP (green triangles), and ACI (gray squares) on the airfoil dataset. The Y-axes represent mean coverage, median interval width, and mean squared error on a holdout test set; the X-axes correspond to the number of active learning query steps, with each query based on posterior variance of a GP regressor. All values are computed over 350 distinct random seeds. Hyperparameters for the experiments are given in Appendix \ref{['subsec:active_learning_exp_details']} Table \ref{['tab:fig3_hyperparams']}.
...and 6 more figures

Theorems & Definitions (5)

Theorem 4.1
Remark 4.2
Remark 4.3
Corollary 2.1
Remark 4.1

Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)

TL;DR

Abstract

Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (5)