Table of Contents
Fetching ...

Calibrated Selective Classification

Adam Fisch, Tommi Jaakkola, Regina Barzilay

TL;DR

Calibrated Selective Classification tackles the challenge of reliable uncertainty by pairing a fixed predictor with a trainable selector to abstain on inputs whose uncertainties are not well calibrated. It introduces selective calibration and the S-MMCE objective, plus a practical upper bound, and uses a DRO-inspired regime with synthetic domain shifts to improve out-of-domain calibration while enforcing a coverage constraint. The framework is validated on image-corruption benchmarks (CIFAR-10-C, ImageNet-C) and a lung cancer risk task, consistently reducing selective calibration error compared to baselines and showing meaningful robustness to distribution shifts. The work demonstrates that calibrated abstention can yield more trustworthy predictions without requiring full retraining of the base model, with strong implications for high-stakes decision making and medical applications.

Abstract

Selective classification allows models to abstain from making predictions (e.g., say "I don't know") when in doubt in order to obtain better effective accuracy. While typical selective models can be effective at producing more accurate predictions on average, they may still allow for wrong predictions that have high confidence, or skip correct predictions that have low confidence. Providing calibrated uncertainty estimates alongside predictions -- probabilities that correspond to true frequencies -- can be as important as having predictions that are simply accurate on average. However, uncertainty estimates can be unreliable for certain inputs. In this paper, we develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties. By doing so, we aim to make predictions with {well-calibrated} uncertainty estimates over the distribution of accepted examples, a property we call selective calibration. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. In particular, our work focuses on achieving robust calibration, where the model is intentionally designed to be tested on out-of-domain data. We achieve this through a training strategy inspired by distributionally robust optimization, in which we apply simulated input perturbations to the known, in-domain training data. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.

Calibrated Selective Classification

TL;DR

Calibrated Selective Classification tackles the challenge of reliable uncertainty by pairing a fixed predictor with a trainable selector to abstain on inputs whose uncertainties are not well calibrated. It introduces selective calibration and the S-MMCE objective, plus a practical upper bound, and uses a DRO-inspired regime with synthetic domain shifts to improve out-of-domain calibration while enforcing a coverage constraint. The framework is validated on image-corruption benchmarks (CIFAR-10-C, ImageNet-C) and a lung cancer risk task, consistently reducing selective calibration error compared to baselines and showing meaningful robustness to distribution shifts. The work demonstrates that calibrated abstention can yield more trustworthy predictions without requiring full retraining of the base model, with strong implications for high-stakes decision making and medical applications.

Abstract

Selective classification allows models to abstain from making predictions (e.g., say "I don't know") when in doubt in order to obtain better effective accuracy. While typical selective models can be effective at producing more accurate predictions on average, they may still allow for wrong predictions that have high confidence, or skip correct predictions that have low confidence. Providing calibrated uncertainty estimates alongside predictions -- probabilities that correspond to true frequencies -- can be as important as having predictions that are simply accurate on average. However, uncertainty estimates can be unreliable for certain inputs. In this paper, we develop a new approach to selective classification in which we propose a method for rejecting examples with "uncertain" uncertainties. By doing so, we aim to make predictions with {well-calibrated} uncertainty estimates over the distribution of accepted examples, a property we call selective calibration. We present a framework for learning selectively calibrated models, where a separate selector network is trained to improve the selective calibration error of a given base model. In particular, our work focuses on achieving robust calibration, where the model is intentionally designed to be tested on out-of-domain data. We achieve this through a training strategy inspired by distributionally robust optimization, in which we apply simulated input perturbations to the known, in-domain training data. We demonstrate the empirical effectiveness of our approach on multiple image classification and lung cancer risk assessment tasks.
Paper Structure (29 sections, 7 theorems, 46 equations, 11 figures, 1 algorithm)

This paper contains 29 sections, 7 theorems, 46 equations, 11 figures, 1 algorithm.

Key Result

Theorem 4.5

Let $k$ be a universal kernel, and let $q \geq 1$. The S-MMCE is then 0 if and only if $(f, g)$ is almost surely selectively calibrated, i.e., $\mathbb{P}(\mathbb{E}[Y \mid V] = V) = 1$, where $V := f(X) \mid g(X) = 1$.

Figures (11)

  • Figure 1: A demonstration of calibrated selective classification for lung cancer risk assessment. The input $X$ is a CT scan; the output $Y$ indicates if the patient develops cancer within 6 years of the scan. $f$ (yellow) provides a confidence estimate for the patient being at risk, while $g$ (red) acts as a gatekeeper to decide when to trust $f$'s outputs. Our goal is for the accepted predictions to be calibrated. For example, of all patients with predicted (and non-rejected) risks of $0.3$, 30% should be expected to develop cancer within 6 years of the scan.
  • Figure 2: Coverage vs. selective calibration error and Brier scores on CIFAR-10-C, ImageNet-C, and lung cancer data. For CIFAR-10-C and ImageNet-C, we report average results across all perturbation types. For lung cancer, we report results on the diverse MGH test population. Empirically, across all coverage levels and metrics, rejections based on $g$ optimized for S-MMCE perform the best (or on par with the best in some cases).
  • Figure 3: $\ell_2$ selective top-label calibration error AUC reported for each of the 15 test perturbations in CIFAR-10-C (in addition to the average). Optimizing S-MMCE leads to significant error reductions across perturbation types, relative to a model without abstentions ("Full"), as well as the standard selective classifier ("Confidence").
  • Figure 4: Empirical distribution of $f(X) \mid g(X) = 1$ for different coverage rates on MGH data (for $f(X) \leq 0.4$ for visualization). Empirically, by selecting the most confident predictions, confidence-based predictions mainly take examples that are thought to be not be cancerous (i.e., where $f(x) \approx \mathbb{P}(Y = 1 \mid X = x)$ is low). The behavior of the S-MMCE-based classifier, however, is less skewed towards only selecting examples of a particular confidence value, and $f(X)\mid g(X) = 1$ more closely follows the marginal distribution of $f(X)$ without selection.
  • Figure 5: Rejection ratios by label type at $\xi = 0.90$ on MGH data. Blue denotes the ratio of each class in the full data. Proportionally more of the confidence-based rejections are cancerous (presumably as they are "harder" to classify). S-MMCE rejections are relatively less imbalanced.
  • ...and 6 more figures

Theorems & Definitions (23)

  • Definition 3.1: Binary calibration error
  • Definition 3.2: Top-label calibration error
  • Definition 3.3: Brier score
  • Definition 4.1: Selective calibration
  • Definition 4.2: Selective calibration error
  • Claim 4.3: Existence of a good selector
  • Example 4.4: Toy setting
  • Theorem 4.5: Faithfulness
  • Proposition 4.6: Relationship to S-BCE
  • Proposition 4.7: S-MMCE upper bound
  • ...and 13 more