Table of Contents
Fetching ...

Optimizing Estimators of Squared Calibration Errors in Classification

Sebastian G. Gruber, Francis Bach

TL;DR

This work tackles the problem of selecting estimators for squared calibration errors in classification by introducing a mean-squared error risk framework that facilitates principled comparison and optimization of calibration estimators. It formalizes a calibration estimation risk, connects it to canonical calibration via $CCE_2$, and recasts existing estimators as calibration estimation functions, while also proposing two kernel ridge regression-based estimators with closed-form solutions. A training–validation–testing pipeline enables hyperparameter tuning and unbiased evaluation of calibration estimators on finite data, demonstrated through simulations and real-world benchmarks like CIFAR-10/100 and ImageNet. The findings show no single estimator dominates across settings, underscoring the need for risk-guided estimator selection and highlighting practical gains from kernel-based approaches in calibration estimation.

Abstract

In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters. By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs. This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset. We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.

Optimizing Estimators of Squared Calibration Errors in Classification

TL;DR

This work tackles the problem of selecting estimators for squared calibration errors in classification by introducing a mean-squared error risk framework that facilitates principled comparison and optimization of calibration estimators. It formalizes a calibration estimation risk, connects it to canonical calibration via , and recasts existing estimators as calibration estimation functions, while also proposing two kernel ridge regression-based estimators with closed-form solutions. A training–validation–testing pipeline enables hyperparameter tuning and unbiased evaluation of calibration estimators on finite data, demonstrated through simulations and real-world benchmarks like CIFAR-10/100 and ImageNet. The findings show no single estimator dominates across settings, underscoring the need for risk-guided estimator selection and highlighting practical gains from kernel-based approaches in calibration estimation.

Abstract

In this work, we propose a mean-squared error-based risk that enables the comparison and optimization of estimators of squared calibration errors in practical settings. Improving the calibration of classifiers is crucial for enhancing the trustworthiness and interpretability of machine learning models, especially in sensitive decision-making scenarios. Although various calibration (error) estimators exist in the current literature, there is a lack of guidance on selecting the appropriate estimator and tuning its hyperparameters. By leveraging the bilinear structure of squared calibration errors, we reformulate calibration estimation as a regression problem with independent and identically distributed (i.i.d.) input pairs. This reformulation allows us to quantify the performance of different estimators even for the most challenging calibration criterion, known as canonical calibration. Our approach advocates for a training-validation-testing pipeline when estimating a calibration error on an evaluation dataset. We demonstrate the effectiveness of our pipeline by optimizing existing calibration estimators and comparing them with novel kernel ridge regression-based estimators on standard image classification tasks.

Paper Structure

This paper contains 25 sections, 2 theorems, 50 equations, 7 figures, 6 tables, 1 algorithm.

Key Result

Theorem 1

For any $h \colon \Delta^d \times \Delta^d \to \mathbb{R}$ for which $h \overset{}{=} h^*$ does not hold $\mathbb{P}_{f \left(X \right)} \otimes \mathbb{P}_{f \left(X \right)}-$almost surely we have that

Figures (7)

  • Figure 1: Simulated experiment for estimating $\operatorname{CCE}_2$ in a task with 5 classes and 500 instances. The empirical risk correctly identifies the ideal calibration estimator with $\theta=1$ indicated by the red line. Standard deviations across multiple seeds indicate the empirical risk stability.
  • Figure 2: Different $\operatorname{TCE}_2$ estimates of different models. Most calibration estimates approximately agree with each other. Only $\operatorname{TCE}_2^{\operatorname{kde}}$ is an outlier for Densenet-40, ResNetWide-32, and Resnet-110 SD. However, it also shows an increased calibration estimation risk in these cases (c.f. Table \ref{['tbl:tce_cif10']}).
  • Figure 3: Average runtime with error bars of the estimators in Figure \ref{['fig:tce_cif10']} on a single CPU thread. Optimizing the hyperparameters has a substantial computational cost.
  • Figure 4: Different $\operatorname{CCE}_2$ estimates for CIFAR100 models. The risk values of Table \ref{['tbl:cce_cif100']} do not relate to the calibration estimate but only indicate which estimator to trust more (here: $\operatorname{CCE}_2^{\operatorname{kde}}$).
  • Figure 5: Different calibration estimates of different models. Most calibration estimates approximately agree with each other. This is in agreement with the similar risk values for each estimator in Table \ref{['tbl:tce_imgnet']} and Table \ref{['tbl:cce_cif10']}.
  • ...and 2 more figures

Theorems & Definitions (8)

  • Definition 1
  • Definition 2
  • Theorem 1
  • Theorem 2
  • Remark
  • Remark
  • proof
  • Remark