Optimizing Calibration by Gaining Aware of Prediction Correctness
Yuchi Liu, Lei Wang, Yuli Zou, James Zou, Liang Zheng
TL;DR
Calibration aims to align predictive confidence with accuracy. The paper introduces a Correctness-Aware (CA) loss that directly encodes the calibration goal by pushing high confidence to correct predictions and low confidence to incorrect ones, using transformed versions of each input to inform correctness. A post-hoc calibrator trained with CA loss and fed transformed inputs learns a temperature that recalibrates logits, achieving competitive performance on both IND and OOD datasets, and improving the separability of correct versus wrong predictions. The approach highlights limitations of CE/MSE losses for narrowly wrong cases and offers a practical, transform-based strategy with strong empirical results and insights for future directions in calibration.
Abstract
Model calibration aims to align confidence with prediction correctness. The Cross-Entropy (CE) loss is widely used for calibrator training, which enforces the model to increase confidence on the ground truth class. However, we find the CE loss has intrinsic limitations. For example, for a narrow misclassification (e.g., a test sample is wrongly classified and its softmax score on the ground truth class is 0.4), a calibrator trained by the CE loss often produces high confidence on the wrongly predicted class, which is undesirable. In this paper, we propose a new post-hoc calibration objective derived from the aim of calibration. Intuitively, the proposed objective function asks that the calibrator decrease model confidence on wrongly predicted samples and increase confidence on correctly predicted samples. Because a sample itself has insufficient ability to indicate correctness, we use its transformed versions (e.g., rotated, greyscaled, and color-jittered) during calibrator training. Trained on an in-distribution validation set and tested with isolated, individual test samples, our method achieves competitive calibration performance on both in-distribution and out-of-distribution test sets compared with the state of the art. Further, our analysis points out the difference between our method and commonly used objectives such as CE loss and Mean Square Error (MSE) loss, where the latters sometimes deviates from the calibration aim.
