Table of Contents
Fetching ...

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael I. Jordan, Lihua Lei

TL;DR

The paper introduces Learn then Test (LTT), a distribution-free framework that post-processes pretrained predictors to deliver finite-sample guarantees on predictive risk via calibration data. It reframes risk control as a multiple-hypothesis-testing problem and uses p-values and FWER-controlling procedures to select lambda thresholds that achieve user-specified error rates without refitting models. The approach covers diverse tasks, including FDR control for multi-label classification, selective classification and regression, OOD-detection with prediction sets, and rigorous instance-segmentation guarantees, showcasing practical calibration methods and code. The results demonstrate that non-monotone, complex risks can be tamed with rigorous, hypothesis-testing-based calibration, enabling safer deployment of modern neural systems across vision and medical domains.

Abstract

We introduce a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees. Our calibration algorithms work with any underlying model and (unknown) data-generating distribution and do not require model refitting. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use the framework to provide new calibration methods for several core machine learning tasks, with detailed worked examples in computer vision and tabular medical data.

Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control

TL;DR

The paper introduces Learn then Test (LTT), a distribution-free framework that post-processes pretrained predictors to deliver finite-sample guarantees on predictive risk via calibration data. It reframes risk control as a multiple-hypothesis-testing problem and uses p-values and FWER-controlling procedures to select lambda thresholds that achieve user-specified error rates without refitting models. The approach covers diverse tasks, including FDR control for multi-label classification, selective classification and regression, OOD-detection with prediction sets, and rigorous instance-segmentation guarantees, showcasing practical calibration methods and code. The results demonstrate that non-monotone, complex risks can be tamed with rigorous, hypothesis-testing-based calibration, enabling safer deployment of modern neural systems across vision and medical domains.

Abstract

We introduce a framework for calibrating machine learning models so that their predictions satisfy explicit, finite-sample statistical guarantees. Our calibration algorithms work with any underlying model and (unknown) data-generating distribution and do not require model refitting. The framework addresses, among other examples, false discovery rate control in multi-label classification, intersection-over-union control in instance segmentation, and the simultaneous control of the type-1 error of outlier detection and confidence set coverage in classification or regression. Our main insight is to reframe the risk-control problem as multiple hypothesis testing, enabling techniques and mathematical arguments different from those in the previous literature. We use the framework to provide new calibration methods for several core machine learning tasks, with detailed worked examples in computer vision and tabular medical data.

Paper Structure

This paper contains 33 sections, 17 theorems, 120 equations, 12 figures, 3 algorithms.

Key Result

Theorem 2.1

Suppose $p_j$ has a distribution stochastically dominating the uniform distribution for all $j$ under $\mathcal{H}_j$. Let $\mathcal{A}$ be an FWER-controlling algorithm at level $\delta$. Then $\hat{\Lambda} = \mathcal{A}(p_1,\dots,p_N)$ satisfies the following: where the supremum over an empty set is defined as $-\infty$. Thus, selecting any $\lambda \in \widehat{\Lambda}$, $\mathcal{T}_{\lambd

Figures (12)

  • Figure 1: Object detection with simultaneous distribution-free guarantees on the expected intersection-over-union, recall, and coverage rate is possible with our methods; see Section \ref{['sec:detection']} for details.
  • Figure 2: Multi-label prediction set examples on MS COCO using fixed-sequence testing. Black classes are correct, blues are spurious, and reds are missed.
  • Figure 4: Numerical results of selective classification on Imagenet. The violins plot the selective error over 100 data splits at levels $\alpha=0.15$ and $\delta=0.1$. The line plot shows the empirical risk and fraction of abstentions when sweeping across values of $\lambda$.
  • Figure 5: Numerical results of selective regression on the MEPS dataset. The MSE is plotted as a violin plot over 100 random splits of the MEPS data, with parameters $\alpha=0.1$, shown as the gray dotted line, and $\delta=0.1$. The fraction of abstentions is plotted similarly. The line plot shows the tradeoff. For details, see Section \ref{['sec:meps']}.
  • Figure 6: Numerical performance of methods for simultaneous OOD type-1 error and coverage control on CIFAR-10 with $\alpha_1=0.05$, $\alpha_2=0.01$, and $\delta=0.1$. The violins quantify coverage, type-1 error, and the power of the OOD procedure against Imagenet images over 1000 random data splits. The gray dotted lines show $\alpha_1$ and $\alpha_2$.
  • ...and 7 more figures

Theorems & Definitions (31)

  • Definition 1: Risk-controlling prediction
  • Theorem 2.1
  • Proposition 2.1: Hoeffding-Bentkus inequality p-values
  • Definition 2: FWER-controlling algorithm
  • Proposition 2.2: Bonferroni controls FWER
  • Proposition 2.3: Fixed sequence testing controls FWER
  • Proposition 2.4
  • Proposition 2.5: SGT controls FWER brentz2009graphical
  • Proposition 2.6
  • Proposition 2.7: Uniform bounds give risk control
  • ...and 21 more