Revisiting Reweighted Risk for Calibration: AURC, Focal, and Inverse Focal Loss
Han Zhou, Sebastian G. Gruber, Teodora Popordanoska, Matthew B. Blaschko
TL;DR
Calibration errors in neural networks can misrepresent predictive reliability, which is critical for high-stakes applications. The paper builds a theoretical bridge between calibration error and selective classification, and introduces a differentiable selective-risk loss based on a bin-based CDF approximation that scales as $O(nK)$ and supports arbitrary confidence score functions. Empirically, the proposed AU loss competes with state-of-the-art trainable and reweighting calibration methods across CIFAR-10/100 and Tiny-ImageNet, often yielding the best class-wise calibration (cwECE) and balanced calibration behavior. The work offers a practical, scalable framework for improving calibration without sacrificing accuracy, with limitations tied to the choice of confidence scores and future potential in better confidence estimation.
Abstract
Several variants of reweighted risk functionals, such as focal loss, inverse focal loss, and the Area Under the Risk--Coverage Curve (AURC), have been proposed for improving model calibration, yet their theoretical connections to calibration errors remain unclear. In this paper, we revisit a broad class of weighted risk functions commonly used in deep learning and establish a principled connection between calibration error and selective classification. We show that minimizing calibration error is closely linked to the selective classification paradigm and demonstrate that optimizing selective risk in low-confidence region naturally leads to improved calibration. This loss shares a similar reweighting strategy with dual focal loss but offers greater flexibility through the choice of confidence score functions (CSFs). Our approach uses a bin-based cumulative distribution function (CDF) approximation, enabling efficient gradient-based optimization without requiring expensive sorting and achieving $O(nK)$ complexity. Empirical evaluations demonstrate that our method achieves competitive calibration performance across a range of datasets and model architectures.
