Table of Contents
Fetching ...

ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure

Hee Suk Yoon, Joshua Tian Jin Tee, Eunseop Yoon, Sunjae Yoon, Gwangsu Kim, Yingzhen Li, Chang D. Yoo

TL;DR

This work tackles neural network miscalibration and the overhead of tuning calibration losses by introducing Expected Squared Difference (ESD), a tuning-free, binning-free calibration objective. ESD measures calibration error as the squared difference between two expectations and provides an unbiased, consistent estimator, enabling training alongside NLL without internal hyperparameters. Across CNN and Transformer architectures on vision and NLP tasks, ESD achieves superior calibration (lower ECE) with only modest accuracy loss, and its hyperparameter-free nature yields substantial computational savings, especially as model and dataset sizes grow. Interleaved training further aids robustness to distribution shifts, and post-processing (temperature or vector scaling) continues to improve calibrated performance.

Abstract

Studies have shown that modern neural networks tend to be poorly calibrated due to over-confident predictions. Traditionally, post-processing methods have been used to calibrate the model after training. In recent years, various trainable calibration measures have been proposed to incorporate them directly into the training process. However, these methods all incorporate internal hyperparameters, and the performance of these calibration objectives relies on tuning these hyperparameters, incurring more computational costs as the size of neural networks and datasets become larger. As such, we present Expected Squared Difference (ESD), a tuning-free (i.e., hyperparameter-free) trainable calibration objective loss, where we view the calibration error from the perspective of the squared difference between the two expectations. With extensive experiments on several architectures (CNNs, Transformers) and datasets, we demonstrate that (1) incorporating ESD into the training improves model calibration in various batch size settings without the need for internal hyperparameter tuning, (2) ESD yields the best-calibrated results compared with previous approaches, and (3) ESD drastically improves the computational costs required for calibration during training due to the absence of internal hyperparameter. The code is publicly accessible at https://github.com/hee-suk-yoon/ESD.

ESD: Expected Squared Difference as a Tuning-Free Trainable Calibration Measure

TL;DR

This work tackles neural network miscalibration and the overhead of tuning calibration losses by introducing Expected Squared Difference (ESD), a tuning-free, binning-free calibration objective. ESD measures calibration error as the squared difference between two expectations and provides an unbiased, consistent estimator, enabling training alongside NLL without internal hyperparameters. Across CNN and Transformer architectures on vision and NLP tasks, ESD achieves superior calibration (lower ECE) with only modest accuracy loss, and its hyperparameter-free nature yields substantial computational savings, especially as model and dataset sizes grow. Interleaved training further aids robustness to distribution shifts, and post-processing (temperature or vector scaling) continues to improve calibrated performance.

Abstract

Studies have shown that modern neural networks tend to be poorly calibrated due to over-confident predictions. Traditionally, post-processing methods have been used to calibrate the model after training. In recent years, various trainable calibration measures have been proposed to incorporate them directly into the training process. However, these methods all incorporate internal hyperparameters, and the performance of these calibration objectives relies on tuning these hyperparameters, incurring more computational costs as the size of neural networks and datasets become larger. As such, we present Expected Squared Difference (ESD), a tuning-free (i.e., hyperparameter-free) trainable calibration objective loss, where we view the calibration error from the perspective of the squared difference between the two expectations. With extensive experiments on several architectures (CNNs, Transformers) and datasets, we demonstrate that (1) incorporating ESD into the training improves model calibration in various batch size settings without the need for internal hyperparameter tuning, (2) ESD yields the best-calibrated results compared with previous approaches, and (3) ESD drastically improves the computational costs required for calibration during training due to the absence of internal hyperparameter. The code is publicly accessible at https://github.com/hee-suk-yoon/ESD.
Paper Structure (27 sections, 32 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 32 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Accuracy (%) curve (left) and its corresponding ECE (%) curve (right) during training with negative log-likelihood (NLL) loss. It could be seen that since NLL implicitly trains for calibration error, the ECE of the train set approaches zero while the ECE of the test set increases during training.
  • Figure 2: ECE performance curve of MMCE (left) and SB-ECE (right) with respect to their varying internal hyperparameters on MNIST, CIFAR10, SNLI datasets.
  • Figure 3: Computational cost of single-run training (left) and total cost considering hyperparameter tuning (right). The x-axis in both cases are in the order of increasing model complexity.
  • Figure 4: Visual intuition plot showing the cumulative confidence and cumulative accuracy with varying quantile scores of prediction confidence ($\alpha$) for an uncalibrated (left) and calibrated (right) network. The uncalibrated network was obtained by training Resnet34 on CIFAR100 with NLL, and the calibrated network was acquired by temperature scaling on the aforementioned trained network.
  • Figure 5: Accuracy plot with respect to varying values of $\lambda$ across different datasets and models trained with MMCE, SB-ECE, and ESD. The threshold accuracy represents the value 1.5% below the baseline accuracy, which was used as the model selection criterion as stated in section \ref{['experimental_setup']}.
  • ...and 1 more figures

Theorems & Definitions (9)

  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof
  • proof