Table of Contents
Fetching ...

Improving Predictor Reliability with Selective Recalibration

Thomas P. Zollo, Zhun Deng, Jake C. Snell, Toniann Pitassi, Richard Zemel

TL;DR

This work proposes selective recalibration, where a selection model learns to reject some user-chosen proportion of the data in order to allow the recalibrator to focus on regions of the input space that can be well-captured by such a model.

Abstract

A reliable deep learning system should be able to accurately express its confidence with respect to its predictions, a quality known as calibration. One of the most effective ways to produce reliable confidence estimates with a pre-trained model is by applying a post-hoc recalibration method. Popular recalibration methods like temperature scaling are typically fit on a small amount of data and work in the model's output space, as opposed to the more expressive feature embedding space, and thus usually have only one or a handful of parameters. However, the target distribution to which they are applied is often complex and difficult to fit well with such a function. To this end we propose \textit{selective recalibration}, where a selection model learns to reject some user-chosen proportion of the data in order to allow the recalibrator to focus on regions of the input space that can be well-captured by such a model. We provide theoretical analysis to motivate our algorithm, and test our method through comprehensive experiments on difficult medical imaging and zero-shot classification tasks. Our results show that selective recalibration consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.

Improving Predictor Reliability with Selective Recalibration

TL;DR

This work proposes selective recalibration, where a selection model learns to reject some user-chosen proportion of the data in order to allow the recalibrator to focus on regions of the input space that can be well-captured by such a model.

Abstract

A reliable deep learning system should be able to accurately express its confidence with respect to its predictions, a quality known as calibration. One of the most effective ways to produce reliable confidence estimates with a pre-trained model is by applying a post-hoc recalibration method. Popular recalibration methods like temperature scaling are typically fit on a small amount of data and work in the model's output space, as opposed to the more expressive feature embedding space, and thus usually have only one or a handful of parameters. However, the target distribution to which they are applied is often complex and difficult to fit well with such a function. To this end we propose \textit{selective recalibration}, where a selection model learns to reject some user-chosen proportion of the data in order to allow the recalibrator to focus on regions of the input space that can be well-captured by such a model. We provide theoretical analysis to motivate our algorithm, and test our method through comprehensive experiments on difficult medical imaging and zero-shot classification tasks. Our results show that selective recalibration consistently leads to significantly lower calibration error than a wide range of selection and recalibration baselines.
Paper Structure (56 sections, 3 theorems, 57 equations, 5 figures, 2 tables)

This paper contains 56 sections, 3 theorems, 57 equations, 5 figures, 2 tables.

Key Result

Theorem 1

Under Assumption as:theta, for any $\delta\in(0,1)$ and $\hat{\theta}$ output by $\mathscr{A}$, there exist thresholds $M\in\mathbb{N}^+$ and $\tau>0$ such that if $\max\{r_1,r_2,\sigma,\|\theta^* \|\}<\tau$ and $m>M$, there exists a positive lower bound $L$, with probability at least $1-\delta$ ove However, there exists $T_0$ and $g_0$ satisfying $\mathbb{E}[g_0(x)]\ge \beta$, such that $\text{SR

Figures (5)

  • Figure 1: Reliability Diagrams for a model that has different calibration error (deviation from the diagonal) in different subsets of the data (here shown in blue and green). The data per subset is binned based on confidence values; each marker represents a bin, and its size depicts the amount of data in the bin. The red dashed diagonal represents perfect calibration, where confidence equals expected accuracy.
  • Figure 2: Selective calibration error on ImageNet and Camelyon17 for coverage level $\beta \in \{0.75, 0.8, 0.85, 0.9\}$. Left: Various recalibration methods are trained using labeled validation data. Middle: Selection baselines including confidence-based rejection and various OOD measures. Right: Selective recalibration with different loss functions.
  • Figure 3: Plots illustrating 1) distribution of confidence among the full distribution and those examples accepted for prediction (i.e., where $g(x)=1$) at coverage level $\beta=0.8$ and 2) selective accuracy in the range $\beta=[0.8,1.0]$.
  • Figure 4: A classifier pre-trained on a mixture model is applied to a target distribution with outliers.
  • Figure 5: Selective calibration error on ImageNet and Camelyon17 for coverage level $\beta \in \{0.75, 0.8, 0.85, 0.9\}$. Left: Various re-calibration methods are trained using labeled validation data. Middle: Selection baselines including confidence-based rejection and various OOD measures. Right: Selective re-calibration with different loss functions.

Theorems & Definitions (8)

  • Definition 1: Target Distribution
  • Theorem 1
  • Theorem 2
  • Definition 2: Formal version of definition \ref{['model']}
  • Lemma 1
  • Proof 1
  • Claim 5
  • Proof 2