Table of Contents
Fetching ...

How Flawed Is ECE? An Analysis via Logit Smoothing

Muthu Chidambaram, Holden Lee, Colin McSwiggen, Semon Rezchikov

TL;DR

Calibration requires predicted probabilities to match true frequencies; this work rigorously analyzes the canonical measure \\(\mathrm{ECE}\\), showing it is lower semicontinuous on general probability spaces and characterizing its discontinuities, with a precise criterion in the general Polish-space setting. It introduces \\(\mathrm{LS}\text{-}\mathrm{ECE}\\), a logit-smoothed, continuous analogue defined by \\(\mathrm{LS}\text{-}\mathrm{ECE}_{\pi,\xi}(h) = \mathbb{E}_{X,\xi}[|\mathbb{E}[Y \mid \rho(h(X)+\xi)] - \rho(h(X)+\xi)|]\\, and proves its continuity in \\(h\\) under mild conditions; it also provides a consistent estimator based on kernel-like regression of \\mathbb{E}[Y|T=t] \\) where \\(T=\rho(h(X)+\xi)\\). The experiments on CIFAR-10/100 and ImageNet show \\(\mathrm{ECE}\\) and \\(\mathrm{LS}\text{-}\mathrm{ECE}\\) take near-identical values across models and bin settings, suggesting the practical impact of \\(\mathrm{ECE}\\) pathologies may be small in real-world settings and that LS-ECE is a robust sanity-check tool. Overall, the paper provides a rigorous, general framework for understanding calibration discontinuities and offers a practical, estimable alternative that aligns with standard practice in large-scale benchmarks.

Abstract

Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.

How Flawed Is ECE? An Analysis via Logit Smoothing

TL;DR

Calibration requires predicted probabilities to match true frequencies; this work rigorously analyzes the canonical measure \, showing it is lower semicontinuous on general probability spaces and characterizing its discontinuities, with a precise criterion in the general Polish-space setting. It introduces \, a logit-smoothed, continuous analogue defined by \\(\mathrm{LS}\text{-}\mathrm{ECE}_{\pi,\xi}(h) = \mathbb{E}_{X,\xi}[|\mathbb{E}[Y \mid \rho(h(X)+\xi)] - \rho(h(X)+\xi)|]\\, and proves its continuity in \ under mild conditions; it also provides a consistent estimator based on kernel-like regression of \\mathbb{E}[Y|T=t] \\) where \\(T=\rho(h(X)+\xi)\\). The experiments on CIFAR-10/100 and ImageNet show \ and \ take near-identical values across models and bin settings, suggesting the practical impact of \ pathologies may be small in real-world settings and that LS-ECE is a robust sanity-check tool. Overall, the paper provides a rigorous, general framework for understanding calibration discontinuities and offers a practical, estimable alternative that aligns with standard practice in large-scale benchmarks.

Abstract

Informally, a model is calibrated if its predictions are correct with a probability that matches the confidence of the prediction. By far the most common method in the literature for measuring calibration is the expected calibration error (ECE). Recent work, however, has pointed out drawbacks of ECE, such as the fact that it is discontinuous in the space of predictors. In this work, we ask: how fundamental are these issues, and what are their impacts on existing results? Towards this end, we completely characterize the discontinuities of ECE with respect to general probability measures on Polish spaces. We then use the nature of these discontinuities to motivate a novel continuous, easily estimated miscalibration metric, which we term Logit-Smoothed ECE (LS-ECE). By comparing the ECE and LS-ECE of pre-trained image classification models, we show in initial experiments that binned ECE closely tracks LS-ECE, indicating that the theoretical pathologies of ECE may be avoidable in practice.
Paper Structure (18 sections, 17 theorems, 54 equations, 5 figures)

This paper contains 18 sections, 17 theorems, 54 equations, 5 figures.

Key Result

Theorem 3.1

[Discontinuities for Discrete ECE] Let $\pi$ be any distribution such that $\mathrm{supp}(\pi_X) = [n]$ for an arbitrary positive integer $n$, and let $g^*(x) = P(Y = 1 \mid X = x)$ denote the ground truth conditional distribution. Then the set of discontinuities of $\mathrm{ECE}_{\pi}$ (in the spac

Figures (5)

  • Figure 1: Implementation of $\mathrm{LS}\text{-}\mathrm{ECE}_{\hat{\pi}, \xi}(h)$ in 10 lines of PyTorch pytorch using broadcast semantics.
  • Figure 2: Comparison of $\mathrm{ECE}_{\mathrm{BIN}, \pi}$ (blue) and $\mathrm{LS}\text{-}\mathrm{ECE}_{\pi, \xi}$ (orange) over bins (and correspondingly, inverse scalings for $\xi$) ranging from 1 to 100 on the model and data setup of Section \ref{['sec:synthdata']}.
  • Figure 3: Comparison of $\mathrm{ECE}_{\mathrm{BIN}, \pi}$ and $\mathrm{LS}\text{-}\mathrm{ECE}_{\pi, \xi}$ for different models on CIFAR datasets over bins/variance scalings ranging from 1 to 100. Solid lines correspond to $\mathrm{ECE}_{\mathrm{BIN}, \pi}$ and dashed lines correspond to $\mathrm{LS}\text{-}\mathrm{ECE}_{\pi, \xi}$.
  • Figure 4: Mean absolute difference between ECE and LS-ECE, as well as ECE and $\textsf{smECE}$, on ImageNet-1K-val over all models considered in Section \ref{['sec:imagenet']}, with one standard deviation error bounds marked using the shaded region.
  • Figure 5: Mean absolute difference between ECE and LS-ECE (using uniform noise instead of Gaussian), as well as ECE and $\textsf{smECE}$, on ImageNet-1K-val over all models considered in Section \ref{['sec:imagenet']}, with one standard deviation error bounds marked using the shaded region.

Theorems & Definitions (36)

  • Definition 3.0
  • Theorem 3.1
  • Lemma 3.1
  • proof
  • Corollary 3.1
  • proof
  • proof : Proof of Theorem \ref{['discretecase']}
  • Proposition 3.1
  • proof
  • Lemma 3.1
  • ...and 26 more