Softmax is not Enough (for Adaptive Conformal Classification)

Navid Akhavan Attar; Hesam Asadollahzadeh; Ling Luo; Uwe Aickelin

Softmax is not Enough (for Adaptive Conformal Classification)

Navid Akhavan Attar, Hesam Asadollahzadeh, Ling Luo, Uwe Aickelin

TL;DR

This work proposes a new approach that leverages information from the pre-softmax logit space, using the Helmholtz Free Energy as a measure of model uncertainty and sample difficulty, and improves the adaptiveness of the prediction sets, leading to a notable increase in both efficiency and adaptiveness.

Abstract

The merit of Conformal Prediction (CP), as a distribution-free framework for uncertainty quantification, depends on generating prediction sets that are efficient, reflected in small average set sizes, while adaptive, meaning they signal uncertainty by varying in size according to input difficulty. A central limitation for deep conformal classifiers is that the nonconformity scores are derived from softmax outputs, which can be unreliable indicators of how certain the model truly is about a given input, sometimes leading to overconfident misclassifications or undue hesitation. In this work, we argue that this unreliability can be inherited by the prediction sets generated by CP, limiting their capacity for adaptiveness. We propose a new approach that leverages information from the pre-softmax logit space, using the Helmholtz Free Energy as a measure of model uncertainty and sample difficulty. By reweighting nonconformity scores with a monotonic transformation of the energy score of each sample, we improve their sensitivity to input difficulty. Our experiments with four state-of-the-art score functions on multiple datasets and deep architectures show that this energy-based enhancement improves the adaptiveness of the prediction sets, leading to a notable increase in both efficiency and adaptiveness compared to baseline nonconformity scores, without introducing any post-hoc complexity.

Softmax is not Enough (for Adaptive Conformal Classification)

TL;DR

Abstract

Paper Structure (56 sections, 5 theorems, 51 equations, 10 figures, 19 tables, 1 algorithm)

This paper contains 56 sections, 5 theorems, 51 equations, 10 figures, 19 tables, 1 algorithm.

Introduction
Motivation and Method
Softmax Unreliability and Implications for Conformal Prediction
Free Energy as a Measure of Epistemic Uncertainty
Energy-based Nonconformity Scores
Experiments
Balanced Training Data
Imbalanced Training Data
Reliability Under Distributional Shift
Desiderata for a Reliable Conformal Classifier on OOD Data
Conclusion
Reproducibility Statement
Notation
Related Works
Improving Prediction Set Efficiency.
...and 41 more sections

Key Result

Proposition 2.1

The Helmholtz free energy $F(x)$ is a valid measure of epistemic uncertainty, as it is linearly proportional to the negative log-likelihood of the model-implied data density $p(x)$.

Figures (10)

Figure 1: Prediction sets from a standard method (APSromano2020classification) versus our energy-based variant, demonstrating improved adaptiveness on ImageNet. (i) For an easy input like the image of a Macaw, whose clear visual cues (vivid colors, long tail) make it simple to classify, our energy-based method produces a smaller, more efficient set. (ii) For a hard input, a bird image labeled as Hummingbird, its appearance deviates from typical hummingbirds (e.g., a thicker, less tapered beak) and shares features with other bird classes, making the image difficult for the model. In this case, the energy-based method returns a larger prediction set, signaling higher uncertainty. (iii) Finally, for an out-of-distribution (OOD) input like a brain MRI that the model wasn't trained on, our method generates a much larger set, warning the user that the prediction is unreliable. This improvement in adaptive behavior is guided by the Helmholtz free energy, which captures the model's uncertainty about an input.
Figure 2: (a) Softmax probability distributions and (b) raw logit outputs of two CIFAR-100 samples computed by a trained ResNet-56. Both samples receive similarly high softmax confidence scores, despite differing significantly in difficulty (1 vs. 27). In contrast, their negative energy scores more clearly reflect this difference.
Figure 3: Distribution of negative energy scores ($-F(x)$), stratified by sample difficulty. As difficulty increases, the distribution shifts toward lower energy values, indicating reduced model confidence.
Figure 4: Distributions of negative energy scores across various class bins under balanced and imbalanced training. Results are for CIFAR-100. (a) Balanced model: scores are consistent across class bins. (b) Imbalanced model ($\lambda = 0.03$): minority classes exhibit lower negative energy scores, reflecting reduced confidence.
Figure 5: Prediction set size distributions for the SAPS score and its energy-based variant with $\alpha=0.05$, on (a) in-distribution CIFAR-100 and (b) out-of-distribution Places365 data. The energy-based variant produces larger prediction sets on OOD data. Here, $\mu$ represents the overall set size.
...and 5 more figures

Theorems & Definitions (12)

Proposition 2.1
Theorem 2.2: Monotonicity of Expected Confidence with Sample Difficulty
Proposition 2.3: Equivalence to Sample-Dependent Thresholding
Theorem 3.1: Free Energy as an Indicator of Class Imbalance
Remark F.1
Theorem G.1: Exchangeability of Energy-Based Nonconformity Scores
proof
proof
proof
proof
...and 2 more

Softmax is not Enough (for Adaptive Conformal Classification)

TL;DR

Abstract

Softmax is not Enough (for Adaptive Conformal Classification)

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (12)