Table of Contents
Fetching ...

Making and Evaluating Calibrated Forecasts

Yuxuan Lu, Yifan Wu, Jason Hartline, Lunjia Hu

TL;DR

This work addresses how to evaluate calibrated probabilistic predictions in multi-class settings with a truthful calibration measure. It generalizes perfectly truthful calibration from binary to multi-class tasks using classwise aggregation, proving truthfulness and a dominance-preserving property that is robust to the number of bins $m$. The authors provide both theoretical results and extensive empirical validation on CIFAR-100, showing that the truthful measure maintains consistent model rankings across binning choices and training checkpoints, unlike non-truthful alternatives. The approach advances principled calibration assessment for multi-class predictions and informs calibration-aware decision making in practice.

Abstract

Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.

Making and Evaluating Calibrated Forecasts

TL;DR

This work addresses how to evaluate calibrated probabilistic predictions in multi-class settings with a truthful calibration measure. It generalizes perfectly truthful calibration from binary to multi-class tasks using classwise aggregation, proving truthfulness and a dominance-preserving property that is robust to the number of bins . The authors provide both theoretical results and extensive empirical validation on CIFAR-100, showing that the truthful measure maintains consistent model rankings across binning choices and training checkpoints, unlike non-truthful alternatives. The approach advances principled calibration assessment for multi-class predictions and informs calibration-aware decision making in practice.

Abstract

Calibrated predictions can be reliably interpreted as probabilities. An important step towards achieving better calibration is to design an appropriate calibration measure to meaningfully assess the miscalibration level of a predictor. A recent line of work initiated by Haghtalab et al. [2024] studies the design of truthful calibration measures: a truthful measure is minimized when a predictor outputs the true probabilities, whereas a non-truthful measure incentivizes the predictor to lie so as to appear more calibrated. All previous calibration measures were non-truthful until Hartline et al. [2025] introduced the first perfectly truthful calibration measures for binary prediction tasks in the batch setting. We introduce a perfectly truthful calibration measure for multi-class prediction tasks, generalizing the work of Hartline et al. [2025] beyond binary prediction. We study common methods of extending calibration measures from binary to multi-class prediction and identify ones that do or do not preserve truthfulness. In addition to truthfulness, we mathematically prove and empirically verify that our calibration measure exhibits superior robustness: it robustly preserves the ordering between dominant and dominated predictors, regardless of the choice of hyperparameters (bin sizes). This result addresses the non-robustness issue of binned ECE, which has been observed repeatedly in prior work.

Paper Structure

This paper contains 39 sections, 6 theorems, 40 equations, 10 figures, 1 table.

Key Result

Theorem 2.6

In binary prediction, for every choice of the hyperparameter $m$, the calibration measure $\ell_2\textup{-}\textsc{qECE}_m$ is truthful. Moreover, for every $p_1^*,\ldots,p_n^*\in [0,1]$, assuming $y_i\in \{0,1\}$ is drawn from $\mathsf{Ber}(p_i^*)$ independently for every $i = 1,\ldots,n$, the expe

Figures (10)

  • Figure 1: We compare each calibration measure with different number of bins. Each dot in the plot is a predictor. The $x$-axis plots the log loss, while the $y$-axis plots a calibration error. \ref{['fig: intro binning size']} replicates the result in minderer2021revisiting.
  • Figure 2: Calibration error and proper losses of different checkpoints of MobileNetV3 on the test set. Each dot in the plot corresponds to one checkpoint. The $x$-axis of each plot is the log loss. The $y$-axis shows a different calibration error / proper loss.
  • Figure 3: Calibration errors and log loss of checkpoints for all models we evaluated. Each dot in the plot corresponds to a checkpoint of one neural network model. The $x$-axis of each plot is the log loss. The $y$-axis shows different calibration errors. We plot in colors the models in the maximal dominant total order and plot the rest of the models in grey.
  • Figure 4: Calibration errors and log loss of checkpoints for all models we evaluated. Each dot in the plot corresponds to a checkpoint of one neural network model. The $x$-axis of each plot is the log loss. The $y$-axis shows different calibration errors. We plot in colors the models in the maximal dominant total order and plot the rest of the models in grey.
  • Figure 5: Proper losses and calibration errors of (MobileNetV3-Small model.
  • ...and 5 more figures

Theorems & Definitions (27)

  • Example 1.1
  • Definition 2.1: Calibration for $k$-class prediction
  • Definition 2.2: Calibration for binary prediction
  • Definition 2.3: Truthfulness
  • Definition 2.4: $\textsc{ECE}$
  • Definition 2.5: Quantile-binned ECE
  • Theorem 2.6: hartline2025perfectly
  • Definition 2.7: Class-wise aggregation
  • Definition 2.8: Confidence Aggregation
  • Definition 2.9: Dominance
  • ...and 17 more