Table of Contents
Fetching ...

A Perfectly Truthful Calibration Measure

Jason Hartline, Lunjia Hu, Yifan Wu

TL;DR

The paper addresses the problem that existing calibration measures are not truthful in batch settings, which can mislead model evaluation. It introduces Averaged Two-Bin Calibration Error (ATB), a simple, perfectly truthful, complete, and sound calibration measure, built from Unnormalized Binned Squared Errors (UBSEs) with a randomized two-bin boundary. ATB is shown to be quadratically related to established measures like smCal and distCal via a constant-factor approximation using ell_1-ATB, and it enables a linear-time calibration tester, improving over prior complexity. The framework also demonstrates truthfulness properties such as preserving Blackwell dominance and enabling robust recalibration results, and it highlights practical benefits like continuity and robustness to hyperparameters. Together, these results yield a theoretically principled, efficient, and robust approach to evaluating probabilistic forecasts in the batch setting with strong implications for downstream decision-making and model calibration practices.

Abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

A Perfectly Truthful Calibration Measure

TL;DR

The paper addresses the problem that existing calibration measures are not truthful in batch settings, which can mislead model evaluation. It introduces Averaged Two-Bin Calibration Error (ATB), a simple, perfectly truthful, complete, and sound calibration measure, built from Unnormalized Binned Squared Errors (UBSEs) with a randomized two-bin boundary. ATB is shown to be quadratically related to established measures like smCal and distCal via a constant-factor approximation using ell_1-ATB, and it enables a linear-time calibration tester, improving over prior complexity. The framework also demonstrates truthfulness properties such as preserving Blackwell dominance and enabling robust recalibration results, and it highlights practical benefits like continuity and robustness to hyperparameters. Together, these results yield a theoretically principled, efficient, and robust approach to evaluating probabilistic forecasts in the batch setting with strong implications for downstream decision-making and model calibration practices.

Abstract

Calibration requires that predictions are conditionally unbiased and, therefore, reliably interpretable as probabilities. A calibration measure quantifies how far a predictor is from perfect calibration. As introduced by Haghtalab et al. (2024), a calibration measure is truthful if it is minimized in expectation when a predictor outputs the ground-truth probabilities. Predicting the true probabilities guarantees perfect calibration, but in reality, when calibration is evaluated on a random sample, all known calibration measures incentivize predictors to lie in order to appear more calibrated. Such lack of truthfulness motivated Haghtalab et al. (2024) and Qiao and Zhao (2025) to construct approximately truthful calibration measures in the sequential prediction setting, but no perfectly truthful calibration measure was known to exist even in the more basic batch setting. We design a simple, perfectly and strictly truthful, sound and complete calibration measure in the batch setting: averaged two-bin calibration error (ATB). ATB is quadratically related to two existing calibration measures: the smooth calibration error smCal and the lower distance to calibration distCal. The simplicity in our definition of ATB makes it efficient and straightforward to compute, allowing us to give the first linear-time calibration testing algorithm, improving a result of Hu et al. (2024). We also introduce a general recipe for constructing truthful measures based on the variance additivity of independent random variables, which proves the truthfulness of ATB as a special case and allows us to construct other truthful calibration measures such as quantile-binned l_2-ECE.

Paper Structure

This paper contains 45 sections, 34 theorems, 144 equations, 3 figures, 1 table.

Key Result

Lemma 1.3

Figures (3)

  • Figure 1: Writing $w$ as a convex combination of threshold functions.
  • Figure 2: The order sensitivity of a truthful error metric. The large circle is an abstraction of the probabilistic space, with a realized state on a corner of the space. The reported prediction lies in the interior of the space. Fixing the realized state, the truthful error, as a function of the prediction, is increasing along the convex combination from the realized state to the reported prediction. For one binary state prediction, fixing the realized state, a truthful error is monotone in the distance between the reported prediction and the state.
  • Figure 3: The comparison of calibration measures with different number of bins. Each dot in the plot is a predictor. The $x$-axis plots the log loss (prediction quality), while the $y$-axis plots a calibration error. \ref{['fig: intro binning size']} replicates the result in minderer2021revisiting. The plots are adopted from lu2025making with permission.

Theorems & Definitions (82)

  • Example 1.1: ECE is not truthful, c.f. sidestep
  • Definition 1.2: Averaged two-bin calibration error
  • Lemma 1.3: Informal, \ref{['lem: error decomposition']}
  • Theorem 1.4: Informal, see \ref{['thm: optimal validity']}
  • Definition 1.5
  • Theorem 1.6: Informal, \ref{['cor:relationship']}
  • Definition 2.1: Calibration
  • Definition 2.2: Calibration of prediction-state distributions
  • Definition 2.3: Expected Calibration Error (ECE) foster1997calibrated
  • Definition 2.4: (Lower) Distance to Calibration utc
  • ...and 72 more