Table of Contents
Fetching ...

Does confidence calibration improve conformal prediction?

Huajun Xi, Jianguo Huang, Kangdao Liu, Lei Feng, Hongxin Wei

TL;DR

This work questions the conventional use of confidence calibration to improve conformal prediction. It shows empirically that post-hoc calibration often enlarges adaptive conformal prediction sets, while inducing high-confidence predictions via smaller temperatures can improve efficiency, though extreme low temperatures face numerical issues. The authors provide a theoretical link between temperature and non-conformity scores, and introduce ConfTS, a loss-driven variant of temperature scaling that optimizes for prediction-set efficiency and generalizes to other post-hoc calibrators. Across image and text tasks, including large language models, ConfTS yields substantial efficiency gains without compromising marginal coverage, highlighting a practical path to more trustworthy uncertainty quantification.

Abstract

Conformal prediction is an emerging technique for uncertainty quantification that constructs prediction sets guaranteed to contain the true label with a predefined probability. Previous works often employ temperature scaling to calibrate classifiers, assuming that confidence calibration benefits conformal prediction. However, the specific impact of confidence calibration on conformal prediction remains underexplored. In this work, we make two key discoveries about the impact of confidence calibration methods on adaptive conformal prediction. Firstly, we empirically show that current confidence calibration methods (e.g., temperature scaling) typically lead to larger prediction sets in adaptive conformal prediction. Secondly, by investigating the role of temperature value, we observe that high-confidence predictions can enhance the efficiency of adaptive conformal prediction. Theoretically, we prove that predictions with higher confidence result in smaller prediction sets on expectation. This finding implies that the rescaling parameters in these calibration methods, when optimized with cross-entropy loss, might counteract the goal of generating efficient prediction sets. To address this issue, we propose Conformal Temperature Scaling (ConfTS), a variant of temperature scaling with a novel loss function designed to enhance the efficiency of prediction sets. This approach can be extended to optimize the parameters of other post-hoc methods of confidence calibration. Extensive experiments demonstrate that our method improves existing adaptive conformal prediction methods in classification tasks, especially with LLMs.

Does confidence calibration improve conformal prediction?

TL;DR

This work questions the conventional use of confidence calibration to improve conformal prediction. It shows empirically that post-hoc calibration often enlarges adaptive conformal prediction sets, while inducing high-confidence predictions via smaller temperatures can improve efficiency, though extreme low temperatures face numerical issues. The authors provide a theoretical link between temperature and non-conformity scores, and introduce ConfTS, a loss-driven variant of temperature scaling that optimizes for prediction-set efficiency and generalizes to other post-hoc calibrators. Across image and text tasks, including large language models, ConfTS yields substantial efficiency gains without compromising marginal coverage, highlighting a practical path to more trustworthy uncertainty quantification.

Abstract

Conformal prediction is an emerging technique for uncertainty quantification that constructs prediction sets guaranteed to contain the true label with a predefined probability. Previous works often employ temperature scaling to calibrate classifiers, assuming that confidence calibration benefits conformal prediction. However, the specific impact of confidence calibration on conformal prediction remains underexplored. In this work, we make two key discoveries about the impact of confidence calibration methods on adaptive conformal prediction. Firstly, we empirically show that current confidence calibration methods (e.g., temperature scaling) typically lead to larger prediction sets in adaptive conformal prediction. Secondly, by investigating the role of temperature value, we observe that high-confidence predictions can enhance the efficiency of adaptive conformal prediction. Theoretically, we prove that predictions with higher confidence result in smaller prediction sets on expectation. This finding implies that the rescaling parameters in these calibration methods, when optimized with cross-entropy loss, might counteract the goal of generating efficient prediction sets. To address this issue, we propose Conformal Temperature Scaling (ConfTS), a variant of temperature scaling with a novel loss function designed to enhance the efficiency of prediction sets. This approach can be extended to optimize the parameters of other post-hoc methods of confidence calibration. Extensive experiments demonstrate that our method improves existing adaptive conformal prediction methods in classification tasks, especially with LLMs.
Paper Structure (36 sections, 10 theorems, 59 equations, 4 figures, 12 tables)

This paper contains 36 sections, 10 theorems, 59 equations, 4 figures, 12 tables.

Key Result

Proposition 3.1

For instance $\boldsymbol{x}\in\mathcal{X}$, let $\mathcal{S}(\bm{x},k,t)$ be the non-conformity score function of an arbitrary class $k\in\mathcal{Y}$, defined as in Eq. eq:score_t. Then, for a fixed temperature $t_0\in\mathbb{R}^{+}$ and $\forall t\in(0,t_0)$, we have

Figures (4)

  • Figure 1: (a) & (b): The performance of APS and RAPS with different temperatures on ImageNet. The results show that high-confidence predictions, with a small temperature, lead to efficient prediction sets. (c): The performance of APS for ResNet18 on ImageNet with extremely low temperatures. In this setting, APS generates large prediction sets with conservative coverage due to finite precision.
  • Figure 2: The performance comparison of prediction sets with different temperatures.
  • Figure 3: An example of softmax probabilities produced by a small temperature.
  • Figure 4: (a)&(b): Average sizes of examples with different difficulties using APS on ResNet18 and ResNet50 respectively. Results show that ConfTS can maintain adaptiveness. (c)&(d) Average sizes of APS employed with ConfTS under various sizes of (c) conformal dataset (d) validation dataset. Results show that our ConfTS is robust to variations in the validation and conformal dataset size.

Theorems & Definitions (18)

  • Proposition 3.1
  • Corollary 3.2
  • Theorem 3.3
  • Definition 3.4: Efficiency Gap
  • Lemma G.1
  • proof
  • Lemma G.2
  • proof
  • Lemma G.3
  • proof
  • ...and 8 more