Optimizing Calibration by Gaining Aware of Prediction Correctness

Yuchi Liu; Lei Wang; Yuli Zou; James Zou; Liang Zheng

Optimizing Calibration by Gaining Aware of Prediction Correctness

Yuchi Liu, Lei Wang, Yuli Zou, James Zou, Liang Zheng

TL;DR

Calibration aims to align predictive confidence with accuracy. The paper introduces a Correctness-Aware (CA) loss that directly encodes the calibration goal by pushing high confidence to correct predictions and low confidence to incorrect ones, using transformed versions of each input to inform correctness. A post-hoc calibrator trained with CA loss and fed transformed inputs learns a temperature that recalibrates logits, achieving competitive performance on both IND and OOD datasets, and improving the separability of correct versus wrong predictions. The approach highlights limitations of CE/MSE losses for narrowly wrong cases and offers a practical, transform-based strategy with strong empirical results and insights for future directions in calibration.

Abstract

Model calibration aims to align confidence with prediction correctness. The Cross-Entropy (CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class. However, we find the CE loss has intrinsic limitations. For example, for a narrow misclassification (e.g., a test sample is wrongly classified and its softmax score on the ground truth class is 0.4), a calibrator trained by the CE loss often produces high confidence on the wrongly predicted class, which is undesirable. In this paper, we propose a new post-hoc calibration objective derived from the aim of calibration. Intuitively, the proposed objective function asks that the calibrator decrease model confidence on wrongly predicted samples and increase confidence on correctly predicted samples. Because a sample itself has insufficient ability to indicate correctness, we use its transformed versions (e.g., rotated, greyscaled, and color-jittered) during calibrator training. Trained on an in-distribution validation set and tested with isolated, individual test samples, our method achieves competitive calibration performance on both in-distribution and out-of-distribution test sets compared with the state of the art. Further, our analysis points out the difference between our method and commonly used objectives such as CE loss and Mean Square Error (MSE) loss, where the latters sometimes deviates from the calibration aim.

Optimizing Calibration by Gaining Aware of Prediction Correctness

TL;DR

Abstract

Paper Structure (22 sections, 16 equations, 10 figures, 4 tables, 1 algorithm)

This paper contains 22 sections, 16 equations, 10 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Approach
Defining calibration error from its goal
Correctness-aware loss
Gaining correctness awareness
Comparison between CA loss and MLE
Experiments
Models and datasets
Calibration methods and evaluation metrics
Main observations
Further analysis
Conclusion
Our Algorithm
Comparison between CA loss and MLE
...and 7 more sections

Figures (10)

Figure 1: A failure example for calibrators trained by the Cross-Entropy (CE) or Mean Square Error (MSE) loss. A classifier makes a wrong prediction of a cat image. Before calibration, the classifier gives probabilities of 0.412 and 0.455 on the ground-truth and predicted classes, respectively. The calibrator trained with the CE loss assigns even higher confidence 0.538 to the wrong class, making things worse, and that trained by MSE maintains a similar confidence 0.456. In comparison, calibrator trained with the proposed Correctness-Aware (CA) loss effectively decreases confidence of this wrong prediction to 0.292, improving calibration.
Figure 2: Calibration pipeline. For a given image sample $\mathbf{X}$, we first obtain its logit vector $\mathbf{z}$ and softmax vector $\mathbf{v}$. We then apply $M$ different transformations (e.g., rotation, greyscale, colorjitter, etc.) on $\mathbf{X}$ to get its transformed versions as well as their related softmax vectors as $\mathbf{v}_i$ ($i \in \mathcal{I}_{M}$). Indices $\mathbf{q}\in\mathbb{R}^{k}$ of the top-$k$ largest probabilities (softmax scores) in $\mathbf{v}$ are used to acquire top-$k$ scores from $\mathbf{v}_i$ to form the concatenated input $\oplus_{i\in\mathcal{I}_{M}}\mathbf{v}_i[\mathbf{q}]$ to the calibrator. The calibrator outputs a temperature $\tau$, then being used to update the logit vector $\mathbf{z}$ to produce the calibrated softmax vector. We use our proposed Correctness-Aware (CA) loss (Sec. \ref{['approach:ca-loss']}).
Figure 3: Comparison of different loss functions w.r.t. temperature and softmax probability of the ground truth (GT) class. In a four-way classification task, we examine a wrongly predicted sample with logit vector $[a, 2.0, 0.1, 0.05]$, where $a\!<\!2$ is the value on the ground truth class. We use $c_\text{gt}$ to donate the softmax score of the GT class. Top: The loss surface plots for varying temperatures and $c_\text{gt}$, with red and blue arrows representing positive and negative temperature gradients, respectively. Bottom: Shows 2D loss curves for varying $c_\text{gt}$. The lines in the bottom charts correspond to the lines of the same color in the top charts. Compared with Maximum Likelihood Estimation (MLE) based functions (e.g., Cross-Entropy, Mean Squared Error), our Correctness-Aware loss minimization does not favor temperatures below 1 for incorrect predictions, while sometimes MLE does.
Figure 4: (Left:) Comparing various combinations of image transformations, including rotation (R), grayscale (S), colorjitter (C), random erasing (E) and Gaussian noise (G). Different colors means different numbers of transformations. Dashed lines denote performance of no calibration and retrieval-based augmentation that accesses test batches. (Right:) Visualization of ROC curves of various calibrators. Existing methods typically do not improve AUC, while our method effectively does. All results in this figure are reported for ObjectNet using the model 'beit_base_patch16_384', as introduced in Appendix \ref{['appendix-exp']}.
Figure 5: Impact of narrowly wrong and absolutely wrong predictions on calibrator performance. (Left:) we craft test sets containing 500 wrongly predicted samples with various degrees of being wrong. For example, the leftmost test set contains narrowly wrong samples, while the rightmost one contains absolutely wrong sample. Calibrator is trained on ImageNet-Val. (Right:) we craft training sets containing 1,000 wrong predictions and 1,000 correct predictions. The wrongly predicted samples also have different degrees of being wrong. We use ImageNet-A as test set. For both subfigures, we use 'beit_base' as the classifier and compare CA with CE and no calibration. Our method is more superior when training/test sets contain more narrowly wrong predictions.
...and 5 more figures

Optimizing Calibration by Gaining Aware of Prediction Correctness

TL;DR

Abstract

Optimizing Calibration by Gaining Aware of Prediction Correctness

Authors

TL;DR

Abstract

Table of Contents

Figures (10)