Table of Contents
Fetching ...

On the Limitations of Temperature Scaling for Distributions with Overlaps

Muthu Chidambaram, Rong Ge

TL;DR

This paper analyzes the limitations of temperature scaling for calibrating distributions with overlapping class supports. It introduces Mixup and its generalization, d-Mixup, as training-time calibration techniques that impose neighborhood constraints on model predictions. Theoretical results show that ERM interpolators with mild regularity yield poor calibration under overlap, even with oracle temperature scaling, while (d-)Mixup interpolators achieve robust calibration on broad subclasses. Empirical evidence from synthetic high-dimensional Gaussian data and image benchmarks with label noise corroborates the theory, demonstrating improved calibration metrics (NLL, ECE, ACE) for Mixup variants. Overall, the work highlights the necessity of training-time calibration in certain regimes and points to neighborhood-constrained approaches as a promising direction for reliable uncertainty estimates in deep classifiers.

Abstract

Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.

On the Limitations of Temperature Scaling for Distributions with Overlaps

TL;DR

This paper analyzes the limitations of temperature scaling for calibrating distributions with overlapping class supports. It introduces Mixup and its generalization, d-Mixup, as training-time calibration techniques that impose neighborhood constraints on model predictions. Theoretical results show that ERM interpolators with mild regularity yield poor calibration under overlap, even with oracle temperature scaling, while (d-)Mixup interpolators achieve robust calibration on broad subclasses. Empirical evidence from synthetic high-dimensional Gaussian data and image benchmarks with label noise corroborates the theory, demonstrating improved calibration metrics (NLL, ECE, ACE) for Mixup variants. Overall, the work highlights the necessity of training-time calibration in certain regimes and points to neighborhood-constrained approaches as a promising direction for reliable uncertainty estimates in deep classifiers.

Abstract

Despite the impressive generalization capabilities of deep neural networks, they have been repeatedly shown to be overconfident when they are wrong. Fixing this issue is known as model calibration, and has consequently received much attention in the form of modified training schemes and post-training calibration procedures such as temperature scaling. While temperature scaling is frequently used because of its simplicity, it is often outperformed by modified training schemes. In this work, we identify a specific bottleneck for the performance of temperature scaling. We show that for empirical risk minimizers for a general set of distributions in which the supports of classes have overlaps, the performance of temperature scaling degrades with the amount of overlap between classes, and asymptotically becomes no better than random when there are a large number of classes. On the other hand, we prove that optimizing a modified form of the empirical risk induced by the Mixup data augmentation technique can in fact lead to reasonably good calibration performance, showing that training-time calibration may be necessary in some situations. We also verify that our theoretical results reflect practice by showing that Mixup significantly outperforms empirical risk minimization (with respect to multiple calibration metrics) on image classification benchmarks with class overlaps introduced in the form of label noise.
Paper Structure (22 sections, 10 theorems, 30 equations, 9 figures, 8 tables)

This paper contains 22 sections, 10 theorems, 30 equations, 9 figures, 8 tables.

Key Result

Lemma 3.2

[Informal Optimality Lemma] Every $g^* \in \mathop{\mathrm{arginf}}\limits_g J_{\mathrm{mix}, d}(g, \mathcal{X}, \mathcal{D}_{\lambda, d})$ (where the $\mathop{\mathrm{arginf}}\limits$ is over all extended $\mathbb{R}^d$-valued functions) satisfies $\phi^y(g^*(z)) = \xi_y(z)/\sum_{s \in [k]} \xi_s(z

Figures (9)

  • Figure 1: Visualization of Definition \ref{['simpledist']} for the case $k = 4$.
  • Figure 2: Confidence histograms and reliability diagrams for ERM + TS and $4$-Mixup models on the test Gaussian data with $\mu = 0.01 * \mathbf{1}$, using 15 bins. Overall accuracy on the test data, as well as average confidence, are reported as dashed lines on the histograms.
  • Figure 3: Mean logit gap change as a function of distance away from the original training point.
  • Figure 4: Confidence histograms and reliability diagrams for ERM + TS and Mixup models on the test Gaussian data with $\mu = 0.25 * \mathbf{1}$, using 15 bins.
  • Figure 5: Confidence histograms and reliability diagrams for ERM + TS and Mixup models on the test Gaussian data with $\mu = 0.05 * \mathbf{1}$, using 15 bins.
  • ...and 4 more figures

Theorems & Definitions (27)

  • Definition 3.1
  • Definition 3.2
  • Lemma 3.2
  • Definition 3.3
  • Remark 3.4
  • Definition 4.1
  • Proposition 4.1
  • Proposition 4.1
  • Definition 4.2
  • Theorem 4.2
  • ...and 17 more