Table of Contents
Fetching ...

Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss

Han Zhou, Sebastian G. Gruber, Teodora Popordanoska, Matthew B. Blaschko

TL;DR

Calibration errors in neural networks can misrepresent predictive reliability, which is critical for high-stakes applications. The paper builds a theoretical bridge between calibration error and selective classification, and introduces a differentiable selective-risk loss based on a bin-based CDF approximation that scales as $O(nK)$ and supports arbitrary confidence score functions. Empirically, the proposed AU loss competes with state-of-the-art trainable and reweighting calibration methods across CIFAR-10/100 and Tiny-ImageNet, often yielding the best class-wise calibration (cwECE) and balanced calibration behavior. The work offers a practical, scalable framework for improving calibration without sacrificing accuracy, with limitations tied to the choice of confidence scores and future potential in better confidence estimation.

Abstract

Several variants of reweighted risk functionals, such as focal loss, inverse focal loss, and the Area Under the Risk--Coverage Curve (AURC), have been proposed for improving model calibration, yet their theoretical connections to calibration errors remain unclear. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between calibration error and selective classification. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing selective risk in low-confidence region naturally leads to improved calibration. This loss shares a similar reweighting strategy with dual focal loss but offers greater flexibility through the choice of confidence score functions (CSFs). Our approach uses a bin-based cumulative distribution function (CDF) approximation, enabling efficient gradient-based optimization without requiring expensive sorting and achieving $O(nK)$ complexity. Empirical evaluations demonstrate that our method achieves competitive calibration performance across a range of datasets and model architectures.

Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss

TL;DR

Calibration errors in neural networks can misrepresent predictive reliability, which is critical for high-stakes applications. The paper builds a theoretical bridge between calibration error and selective classification, and introduces a differentiable selective-risk loss based on a bin-based CDF approximation that scales as and supports arbitrary confidence score functions. Empirically, the proposed AU loss competes with state-of-the-art trainable and reweighting calibration methods across CIFAR-10/100 and Tiny-ImageNet, often yielding the best class-wise calibration (cwECE) and balanced calibration behavior. The work offers a practical, scalable framework for improving calibration without sacrificing accuracy, with limitations tied to the choice of confidence scores and future potential in better confidence estimation.

Abstract

Several variants of reweighted risk functionals, such as focal loss, inverse focal loss, and the Area Under the Risk--Coverage Curve (AURC), have been proposed for improving model calibration, yet their theoretical connections to calibration errors remain unclear. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between calibration error and selective classification. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing selective risk in low-confidence region naturally leads to improved calibration. This loss shares a similar reweighting strategy with dual focal loss but offers greater flexibility through the choice of confidence score functions (CSFs). Our approach uses a bin-based cumulative distribution function (CDF) approximation, enabling efficient gradient-based optimization without requiring expensive sorting and achieving complexity. Empirical evaluations demonstrate that our method achieves competitive calibration performance across a range of datasets and model architectures.

Paper Structure

This paper contains 26 sections, 6 theorems, 43 equations, 17 figures, 6 tables, 1 algorithm.

Key Result

Lemma 4.1

Let $f:\mathcal{X}\!\to\!\Delta^{k}$ be the classifier and $\widehat{\operatorname{err}}(f)$ be its empirical error rate on a finite set $\{(\boldsymbol{x}_i,\boldsymbol{y}_i)\}_{i=1}^{n}$ given by $\widehat{\operatorname{err}}(f)= \frac{1}{n}\sum_{i=1}^{n}\mathbb{I}\bigl(\hat{y}_i\neq y_i^\prime \b

Figures (17)

  • Figure 1: Normalized Weights
  • Figure 1: Focal loss
  • Figure 2: Loss
  • Figure 4: CIFAR-10
  • Figure 4: Illustration of DFL behavior on misclassified samples where $p_{y'} < p_j$.
  • ...and 12 more figures

Theorems & Definitions (19)

  • Definition 3.1: Focal Loss
  • Definition 3.2: Inverse Focal Loss
  • Definition 3.3: AURC
  • Definition 3.4: Calibrated
  • Definition 3.5: $L_\rho$ Calibration Error ($\text{CE}_\rho$)
  • Definition 3.6: Binned estimator $\widehat{\text{ECE}}_{B}$
  • Definition 3.7: Class-Wise Calibration Error ($\text{cwECE}_\rho$)
  • Lemma 4.1: Lower bound of ECE Ma2021a
  • Proposition 4.2: Optimal Calibration Map for ECE Ma2021a
  • Proposition 4.3: Lower Bound of cwECE
  • ...and 9 more