Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Fei Zhu; Xu-Yao Zhang; Zhen Cheng; Cheng-Lin Liu

Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Fei Zhu, Xu-Yao Zhang, Zhen Cheng, Cheng-Lin Liu

TL;DR

This work questions the assumption that confidence calibration and OOD detection naturally improve failure prediction in neural classifiers. By showing that many calibration and OOD detection methods impair the separation between correct and misclassified predictions, it reframes failure prediction as a Bayes-like decision problem and links it to flatness of the loss landscape. The authors introduce FMFP, a simple, plug-and-play approach that combines SWA and SAM to realize flat minima, supported by PAC-Bayes theory and strong empirical gains across balanced, long-tailed, and covariate-shift regimes, as well as improved OOD detection. The results establish a robust, unified baseline for reliable confidence estimation and illuminate the connections between calibration, OOD detection, and failure prediction with practical implications for safety-critical AI.

Abstract

Reliable confidence estimation is a challenging yet fundamental requirement in many risk-sensitive applications. However, modern deep neural networks are often overconfident for their incorrect predictions, i.e., misclassified samples from known classes, and out-of-distribution (OOD) samples from unknown classes. In recent years, many confidence calibration and OOD detection methods have been developed. In this paper, we find a general, widely existing but actually-neglected phenomenon that most confidence estimation methods are harmful for detecting misclassification errors. We investigate this problem and reveal that popular calibration and OOD detection methods often lead to worse confidence separation between correctly classified and misclassified examples, making it difficult to decide whether to trust a prediction or not. Finally, we propose to enlarge the confidence gap by finding flat minima, which yields state-of-the-art failure prediction performance under various settings including balanced, long-tailed, and covariate-shift classification scenarios. Our study not only provides a strong baseline for reliable confidence estimation but also acts as a bridge between understanding calibration, OOD detection, and failure prediction. The code is available at \url{https://github.com/Impression2805/FMFP}.

Revisiting Confidence Estimation: Towards Reliable Failure Prediction

TL;DR

Abstract

Paper Structure (27 sections, 4 theorems, 25 equations, 17 figures, 9 tables, 1 algorithm)

This paper contains 27 sections, 4 theorems, 25 equations, 17 figures, 9 tables, 1 algorithm.

Introduction
Problem Formulation and Background
Confidence Calibration
OOD Detection
Failure Prediction
Does Calibration and OOD Detection Help Failure Prediction?
Experimental Setup
Experimental Results
Further Discussion and Analysis
Discussion on Calibration for Failure Prediction
Discussion on OOD detection for Failure Prediction
Finding Flat Minima for Reliable Confidence Estimation
Motivation and Methodology
Motivation
Methodology
...and 12 more sections

Key Result

Proposition 1

(Calibration-Discrimination Decomposition). Let $C$ be the jointly calibrated scores i.e., $C_k = \mathbb{P}(Y = e_k|S = s)$ for $k = 1,...,K.$ the divergence of strictly proper scoring rules can be decomposed as kull2015noveldimitriadis2021stableperez2022beyond:

Figures (17)

Figure 1: A comparison of (a) AURC ($\downarrow$) and (b) AUROC ($\uparrow$). We observed that many popular confidence calibration and OOD detection methods are useless or harmful for failure prediction. We propose a simple flat minima based method that can yield state-of-the-art failure prediction performance. ResNet110 HeZRS16 on CIFAR-10 krizhevsky2009learning.
Figure 2: Calibration reduces the mismatch between confidence and accuracy, and OOD detection distinguishes OOD samples from InD samples. They share the same motivation to provide reliable confidence for trustworthy AI. In practice, both OOD and misclassified samples are failure sources and should be rejected together.
Figure 3: Comparison of risk-coverage curves. At a given coverage, the lower risk is better. ResNet110 on CIFAR-10.
Figure 4: Large-scale experiments on ImageNet deng2009imagenet.
Figure 5: Average confidence of correct samples during training and confidence distribution of correct and misclassified samples. ResNet110 on CIFAR-10.
...and 12 more figures

Theorems & Definitions (8)

Definition 1
Remark 1
Definition 2
Definition 3
Proposition 1
Proposition 2
Proposition 3
Theorem 1

Revisiting Confidence Estimation: Towards Reliable Failure Prediction

TL;DR

Abstract

Revisiting Confidence Estimation: Towards Reliable Failure Prediction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (8)