Table of Contents
Fetching ...

Extending confidence calibration to generalised measures of variation

Andrew Thompson, Vivek Desai

TL;DR

The paper addresses calibration of multiclass classifiers beyond confidence by introducing Variation Calibration Error ($\mathcal{V}$-based VCE), a framework for assessing generalized variation metrics. It extends the conventional Expected Calibration Error by ranking predicted probabilities into reverse order and comparing the aggregated $\mathcal{V}$-variations of predictions with the observed rankings of the true class across bins. It shows that ECE is a special case when the variation metric is confidence and demonstrates that entropy as $\mathcal{V}$ yields more intuitive calibration behavior than the previously proposed Uncertainty Calibration Error (UCE). Empirical results on synthetic, perfectly calibrated predictions indicate that VCE and ECE converge to zero as the sample size grows, while UCE stabilizes at a nonzero floor, supporting VCE as a robust calibration diagnostic with potential for broader use.

Abstract

We propose the Variation Calibration Error (VCE) metric for assessing the calibration of machine learning classifiers. The metric can be viewed as an extension of the well-known Expected Calibration Error (ECE) which assesses the calibration of the maximum probability or confidence. Other ways of measuring the variation of a probability distribution exist which have the advantage of taking into account the full probability distribution, for example the Shannon entropy. We show how the ECE approach can be extended from assessing confidence calibration to assessing the calibration of any metric of variation. We present numerical examples upon synthetic predictions which are perfectly calibrated by design, demonstrating that, in this scenario, the VCE has the desired property of approaching zero as the number of data samples increases, in contrast to another entropy-based calibration metric (the UCE) which has been proposed in the literature.

Extending confidence calibration to generalised measures of variation

TL;DR

The paper addresses calibration of multiclass classifiers beyond confidence by introducing Variation Calibration Error (-based VCE), a framework for assessing generalized variation metrics. It extends the conventional Expected Calibration Error by ranking predicted probabilities into reverse order and comparing the aggregated -variations of predictions with the observed rankings of the true class across bins. It shows that ECE is a special case when the variation metric is confidence and demonstrates that entropy as yields more intuitive calibration behavior than the previously proposed Uncertainty Calibration Error (UCE). Empirical results on synthetic, perfectly calibrated predictions indicate that VCE and ECE converge to zero as the sample size grows, while UCE stabilizes at a nonzero floor, supporting VCE as a robust calibration diagnostic with potential for broader use.

Abstract

We propose the Variation Calibration Error (VCE) metric for assessing the calibration of machine learning classifiers. The metric can be viewed as an extension of the well-known Expected Calibration Error (ECE) which assesses the calibration of the maximum probability or confidence. Other ways of measuring the variation of a probability distribution exist which have the advantage of taking into account the full probability distribution, for example the Shannon entropy. We show how the ECE approach can be extended from assessing confidence calibration to assessing the calibration of any metric of variation. We present numerical examples upon synthetic predictions which are perfectly calibrated by design, demonstrating that, in this scenario, the VCE has the desired property of approaching zero as the number of data samples increases, in contrast to another entropy-based calibration metric (the UCE) which has been proposed in the literature.
Paper Structure (8 sections, 14 equations, 3 figures)

This paper contains 8 sections, 14 equations, 3 figures.

Figures (3)

  • Figure 1: Metric results for the equal-width binning strategy for the VCE, ECE, and UCE metrics. Results are shown for the $3$-class (top) and $10$-class (bottom) classification problems, with two different sets of $\alpha$ parameters (equally weighted (left) and heavily skewed (right)), across four different numbers of samples.
  • Figure 2: Reliability diagrams for the ECE (left), UCE (middle) and VCE (right) for an example experiment, with $C=3$, equally-weighted $\alpha$ parameters and $N=10^{7}$. We adopt an equal-width binning strategy for these results. The black dashed line indicates perfect calibration. Respective metric values are shown in the legend of each plot.
  • Figure 3: Metric results for the equal-frequency binning strategy for the VCE, ECE, and UCE metrics. Results are shown for the 3-class (top) and 10-class (bottom) classification problems, with two different sets of $\alpha$ parameters (equally weighted (left) and heavily skewed (right)), across four different numbers of samples.