Table of Contents
Fetching ...

Information-theoretic Generalization Analysis for Expected Calibration Error

Futoshi Futami, Masahiro Fujisawa

TL;DR

This work addresses the gap in understanding the estimation bias of binning-based calibration errors by providing a unified information-theoretic treatment of both uniform width (UWB) and uniform mass binning (UMB). It derives sharp upper bounds on the total bias of the binned ECE, identifies the optimal bin count B = O(n_te^{1/3}) that minimizes this bias, and shows the resulting bias scales as O(n_te^{-1/3}). Extending to generalization analysis, the authors develop IT-based bounds for the ECE and TCE gaps via eCMI/fCMI, relate these to metric entropy, and analyze the impact of data reuse in recalibration. Experimental results on synthetic and real datasets confirm the nonvacuity of the bounds and illustrate practical bin-size guidance, including the potential benefits of reusing training data for recalibration when calibration generalizes well.

Abstract

While the expected calibration error (ECE), which employs binning, is widely adopted to evaluate the calibration performance of machine learning models, theoretical understanding of its estimation bias is limited. In this paper, we present the first comprehensive analysis of the estimation bias in the two common binning strategies, uniform mass and uniform width binning. Our analysis establishes upper bounds on the bias, achieving an improved convergence rate. Moreover, our bounds reveal, for the first time, the optimal number of bins to minimize the estimation bias. We further extend our bias analysis to generalization error analysis based on the information-theoretic approach, deriving upper bounds that enable the numerical evaluation of how small the ECE is for unknown data. Experiments using deep learning models show that our bounds are nonvacuous thanks to this information-theoretic generalization analysis approach.

Information-theoretic Generalization Analysis for Expected Calibration Error

TL;DR

This work addresses the gap in understanding the estimation bias of binning-based calibration errors by providing a unified information-theoretic treatment of both uniform width (UWB) and uniform mass binning (UMB). It derives sharp upper bounds on the total bias of the binned ECE, identifies the optimal bin count B = O(n_te^{1/3}) that minimizes this bias, and shows the resulting bias scales as O(n_te^{-1/3}). Extending to generalization analysis, the authors develop IT-based bounds for the ECE and TCE gaps via eCMI/fCMI, relate these to metric entropy, and analyze the impact of data reuse in recalibration. Experimental results on synthetic and real datasets confirm the nonvacuity of the bounds and illustrate practical bin-size guidance, including the potential benefits of reusing training data for recalibration when calibration generalizes well.

Abstract

While the expected calibration error (ECE), which employs binning, is widely adopted to evaluate the calibration performance of machine learning models, theoretical understanding of its estimation bias is limited. In this paper, we present the first comprehensive analysis of the estimation bias in the two common binning strategies, uniform mass and uniform width binning. Our analysis establishes upper bounds on the bias, achieving an improved convergence rate. Moreover, our bounds reveal, for the first time, the optimal number of bins to minimize the estimation bias. We further extend our bias analysis to generalization error analysis based on the information-theoretic approach, deriving upper bounds that enable the numerical evaluation of how small the ECE is for unknown data. Experiments using deep learning models show that our bounds are nonvacuous thanks to this information-theoretic generalization analysis approach.
Paper Structure (61 sections, 17 theorems, 162 equations, 6 figures, 4 tables)

This paper contains 61 sections, 17 theorems, 162 equations, 6 figures, 4 tables.

Key Result

Theorem 1

Under the CMI setting, we have where $\mathrm{eCMI}(l)\coloneqq I(l(\mathcal{A}(\tilde{Z}_U,R),\tilde{Z});U|\tilde{Z})$ and $l(\mathcal{A}(\tilde{Z}_U,R),\tilde{Z})$ is an $n \times 2$ loss matrix obtained by applying $l(\mathcal{A}(\tilde{Z}_U,R),\cdot)$ elementwise to $\tilde{Z}$.

Figures (6)

  • Figure 1: Behavior of the upper bound in Eq. \ref{['eq_test_data_use_total_bias']} as $n$ increases when UWB is used. The following two terms: less calibrate and better calibrate refer to $\beta = (0.5, -1.5)$ and $\beta = (0.2, -1.9)$, respectively, where the former setting produces a worse value of the TCE estimator.
  • Figure 2: Behavior of the upper bound in Eq. \ref{['eq:bias_bound']} for various $B$ as $n$ increases (mean $\pm$ std.). For clarity, only the results using UMB are shown. The ECE gap is shown for $B = \lfloor n^{1/3} \rfloor$ since the change in $B$ did not result in significant differences. We refer to Figure \ref{['fig:boundplot_logscale']} in Appendix \ref{['app:bound_plot_various']} for a detailed analysis of the relationship between (log-scaled) ECE gap values and bound values across different bin settings.
  • Figure 3: Behavior of the upper bound in Eq. \ref{['eq:bias_bound']} for various $B$ as $n$ increases (mean $\pm$ std.). For clarity, only the results using UWB are shown. The ECE gap is evaluated by estimating $\mathbb{E}_{R,S_{\mathrm{tr}},S_{\mathrm{te}}}[|\mathrm{ECE}(f_W,S_{\mathrm{te}})-\mathrm{ECE}(f_W, S_{{\mathrm{tr}}})|]$. The ECE gap is shown for $B = \lfloor n^{1/3} \rfloor$ since the change in $B$ did not result in significant differences.
  • Figure 4: Behavior of the upper bound in Eq. \ref{['eq:tight_bound_thm7']} as $n$ increases for different number of bins (mean $\pm$ std.) when using UMB after recalibration.
  • Figure 5: Behavior of the upper bound in Eq. \ref{['eq:bias_bound']} for various $B$ as $n$ increases (mean $\pm$ std.; log-scale) when UMB is used. The ECE gap is evaluated by estimating $\mathbb{E}_{R,S_{\mathrm{tr}},S_{\mathrm{te}}}[|\mathrm{ECE}(f_W,S_{\mathrm{te}})-\mathrm{ECE}(f_W, S_{{\mathrm{tr}}})|]$. These results show that the variance of the ECE gap obtained in non-optimal $B$ settings is large, while the ECE gap in settings based on the optimal $B$ is stable.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Theorem 1: Theorem 6.7 in steinke20a
  • Theorem 2: Statistical bias analysis
  • proof : Proof sketch
  • Theorem 3: Binning bias analysis
  • proof : Proof sketch
  • Corollary 1
  • Theorem 4: Generalization error bound of the ECE
  • proof : Proof sketch
  • Theorem 5: Generalization error bound of the TCE
  • Theorem 6: Metric entropy
  • ...and 23 more