Decoupling of neural network calibration measures

Dominik Werner Wolf; Prasannavenkatesh Balaji; Alexander Braun; Markus Ulrich

Decoupling of neural network calibration measures

Dominik Werner Wolf, Prasannavenkatesh Balaji, Alexander Braun, Markus Ulrich

TL;DR

This work reveals a fundamental decoupling among common neural network calibration metrics, showing that optimizing one measure (e.g., ECE or UCE) does not guarantee optimality for others (e.g., CCQS/UCQS or AUSE-based criteria). It analyzes the relationships between calibration approaches, metrics, and uncertainty decomposition, using a UNET for semantic segmentation on the A2D2 dataset to demonstrate that temperature scaling can move calibration targets in different directions across metrics. By introducing and evaluating AUSE as a sparsification-based uncertainty measure and proposing AUSE$_{CE}$ as a residual-uncertainty estimator, the authors argue that a portion of calibration uncertainty remains irreducible under a fixed architecture, shaped by both aleatoric data randomness and the limited hypothesis space. The findings suggest that relying on a single calibration metric can yield ambiguous and potentially unsafe prediction intervals, highlighting the need for metric-aware calibration strategies in safety-critical applications and offering a pathway to disentangle model bias from irreducible uncertainty.

Abstract

A lot of effort is currently invested in safeguarding autonomous driving systems, which heavily rely on deep neural networks for computer vision. We investigate the coupling of different neural network calibration measures with a special focus on the Area Under the Sparsification Error curve (AUSE) metric. We elaborate on the well-known inconsistency in determining optimal calibration using the Expected Calibration Error (ECE) and we demonstrate similar issues for the AUSE, the Uncertainty Calibration Score (UCS), as well as the Uncertainty Calibration Error (UCE). We conclude that the current methodologies leave a degree of freedom, which prevents a unique model calibration for the homologation of safety-critical functionalities. Furthermore, we propose the AUSE as an indirect measure for the residual uncertainty, which is irreducible for a fixed network architecture and is driven by the stochasticity in the underlying data generation process (aleatoric contribution) as well as the limitation in the hypothesis space (epistemic contribution).

Decoupling of neural network calibration measures

TL;DR

as a residual-uncertainty estimator, the authors argue that a portion of calibration uncertainty remains irreducible under a fixed architecture, shaped by both aleatoric data randomness and the limited hypothesis space. The findings suggest that relying on a single calibration metric can yield ambiguous and potentially unsafe prediction intervals, highlighting the need for metric-aware calibration strategies in safety-critical applications and offering a pathway to disentangle model bias from irreducible uncertainty.

Abstract

Paper Structure (16 sections, 10 equations, 4 figures, 1 table)

This paper contains 16 sections, 10 equations, 4 figures, 1 table.

Introduction
Theory and related work
Calibration approaches
Calibration metrics
Expected Calibration Error:
Uncertainty Calibration Error:
Uncertainty Calibration Score:
Area Under Sparsification Error:
Uncertainty decomposition
Evaluation setup
Evaluation results
Impact of temperature scaling
Class-wise analysis
Average performance of the calibration measures
AUSE as an uncertainty estimator
...and 1 more sections

Figures (4)

Figure 1: The reliability diagrams for the untempered case ($T$ = 1.0, left) and for the optimal temperature (right) are illustrated along with the enclosed areas $A_{CC}$ and $A_{UC}$.
Figure 2: Calibration does not guarantee that the miscalibration in every bin is minimized since the bin-wise distribution of predictions is extremely skewed. This provides evidence on why the CCQS and UCQS for skewed data will not be coherently maximized as the ECE and the UCE are minimized.
Figure 3: Left: The normalized calibration loss surface of NLL, Brier score, the mean of class-wise ECE and the UCE are plotted. The optimal temperature for the Brier score, ECE and UCE coincide at $T$ = 0.4 while the NLL indicates a minimum at $T$ = 0.6. Right: The normalized calibration loss surface of AUSE$_V$ and AUSE$_S$ indicate minima at $T$=1.8 and $T$=0.9 respectively.
Figure 4: Only classes with an occurrence of $>10^{7}$ are considered. It is evident that the class-wise AUSE$_{CE}$ converges to the lowest value for the 400-epoch model.

Decoupling of neural network calibration measures

TL;DR

Abstract

Decoupling of neural network calibration measures

Authors

TL;DR

Abstract

Table of Contents

Figures (4)