Table of Contents
Fetching ...

All Models Are Miscalibrated, But Some Less So: Comparing Calibration with Conditional Mean Operators

Peter Moskvichev, Dino Sejdinovic

TL;DR

This paper tackles the challenge of comparing calibration across probabilistic classifiers in high-risk applications. It introduces CKCE, a kernel-based measure grounded in conditional mean operators that assesses differences between conditional distributions while mitigating sensitivity to marginal prediction distributions. Through synthetic and real-data experiments (including ImageNet), CKCE demonstrates more consistent model rankings under distribution shift and covariate shift than existing metrics like ECE and JKCE. The findings suggest CKCE as a robust tool for relative calibration assessment and a potential regularization target in model training, with practical implications for deploying calibrated probabilistic predictions.

Abstract

When working in a high-risk setting, having well calibrated probabilistic predictive models is a crucial requirement. However, estimators for calibration error are not always able to correctly distinguish which model is better calibrated. We propose the \emph{conditional kernel calibration error} (CKCE) which is based on the Hilbert-Schmidt norm of the difference between conditional mean operators. By working directly with the definition of strong calibration as the distance between conditional distributions, which we represent by their embeddings in reproducing kernel Hilbert spaces, the CKCE is less sensitive to the marginal distribution of predictive models. This makes it more effective for relative comparisons than previously proposed calibration metrics. Our experiments, using both synthetic and real data, show that CKCE provides a more consistent ranking of models by their calibration error and is more robust against distribution shift.

All Models Are Miscalibrated, But Some Less So: Comparing Calibration with Conditional Mean Operators

TL;DR

This paper tackles the challenge of comparing calibration across probabilistic classifiers in high-risk applications. It introduces CKCE, a kernel-based measure grounded in conditional mean operators that assesses differences between conditional distributions while mitigating sensitivity to marginal prediction distributions. Through synthetic and real-data experiments (including ImageNet), CKCE demonstrates more consistent model rankings under distribution shift and covariate shift than existing metrics like ECE and JKCE. The findings suggest CKCE as a robust tool for relative calibration assessment and a potential regularization target in model training, with practical implications for deploying calibrated probabilistic predictions.

Abstract

When working in a high-risk setting, having well calibrated probabilistic predictive models is a crucial requirement. However, estimators for calibration error are not always able to correctly distinguish which model is better calibrated. We propose the \emph{conditional kernel calibration error} (CKCE) which is based on the Hilbert-Schmidt norm of the difference between conditional mean operators. By working directly with the definition of strong calibration as the distance between conditional distributions, which we represent by their embeddings in reproducing kernel Hilbert spaces, the CKCE is less sensitive to the marginal distribution of predictive models. This makes it more effective for relative comparisons than previously proposed calibration metrics. Our experiments, using both synthetic and real data, show that CKCE provides a more consistent ranking of models by their calibration error and is more robust against distribution shift.

Paper Structure

This paper contains 22 sections, 12 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Reliability diagram and distribution of prediction confidence for two models trained on ImageNet. A lower CKCE value indicates a closer match between model confidence and accuracy, whereas ECE is heavily affected by the marginal distribution of predictions.
  • Figure 2: The CKCE estimator remains stable under covariate shift, whereas JKCE and ECE are highly sensitive to changes in the input distribution.
  • Figure 3: Calibration error of models with changing image brightness. CKCE provides consistent model preference, unlike JKCE.
  • Figure 4: CKCE of a probabilistic model for three choices of kernel on the input variable. Using a kernel that combines a linear and a Gaussian component provides a more robust measure of calibration error.

Theorems & Definitions (1)

  • definition thmcounterdefinition