On Calibration of Large Language Models: From Response To Capability

Sin-Han Yang; Cheng-Kuang Wu; Chieh-Yen Lin; Yun-Nung Chen; Hung-yi Lee; Shao-Hua Sun

On Calibration of Large Language Models: From Response To Capability

Sin-Han Yang, Cheng-Kuang Wu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee, Shao-Hua Sun

TL;DR

This work formally distinguish capability calibration from response calibration and shows that the two differ both theoretically and empirically, and demonstrates that capability-calibrated confidence improves pass@k$ prediction and inference budget allocation.

Abstract

Large language models (LLMs) are widely deployed as general-purpose problem solvers, making accurate confidence estimation critical for reliable use. Prior work on LLM calibration largely focuses on response-level confidence, which estimates the correctness of a single generated output. However, this formulation is misaligned with many practical settings where the central question is how likely a model is to solve a query overall. We show that this mismatch results from the stochastic nature of modern LLM decoding, under which single-response correctness fails to reflect underlying model capability. To address this issue, we introduce capability calibration, which targets the model's expected accuracy on a query. We formally distinguish capability calibration from response calibration and show that the two differ both theoretically and empirically. We establish an empirical evaluation setup and study a range of confidence estimation methods. Our results demonstrate that capability-calibrated confidence improves pass@$k$ prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

On Calibration of Large Language Models: From Response To Capability

TL;DR

Abstract

prediction and inference budget allocation, establishing a foundation with potential for diverse applications.

Paper Structure (50 sections, 2 theorems, 41 equations, 11 figures, 8 tables)

This paper contains 50 sections, 2 theorems, 41 equations, 11 figures, 8 tables.

Introduction
Related Works
LLM confidence estimation and calibration
LLM uncertainty quantification
Capability Calibration
Definition
Difference between response calibration and capability calibration
Measuring Capability Calibration
Evaluation framework
Methods for confidence estimation
Experiments
Setup
Results and Discussion
Applications
Pass@$k$ simulation
...and 35 more sections

Key Result

Theorem 1

(Divergence of targets and optima). Let $x$ be an input and $\hat{y} \sim f_\theta(\cdot \mid x)$ be a generated response. Minimizing the Brier scores for response calibration (equ: classical brier score) and capability calibration (equ: new brier score) yields distinct optimal confidence estimators Unless the model is deterministic, or its predictions are always correct or always incorrect, the e

Figures (11)

Figure 1: Definitions of (a) response calibration and our proposed (b) capability calibration. Given an input $x$, a model $f_{\theta}$, and its single sampled output $\hat{y}$, response calibration calibrates the confidence $s(x,\hat{y})$ against the correctness $\mathcal{C}$ of $\hat{y}$. By contrast, capability calibration calibrates the confidence $s(x,f_{\theta})$ against the expected accuracy $\mu$ of the $f_{\theta}$'s output distribution.
Figure 2: Divergence of calibration targets. We plot the Response Calibration (RC) target $\mathcal{C}(x, \hat{y})$ and Capability Calibration (CC) targets $\mathbb{E}_{\hat{y} \sim f_\theta(\cdot \mid x)}[\mathcal{C}(x,\hat{y})]$. The data reveals a divergence between the two targets: instances where the RC label is 0 exhibit CC values spanning the full [0, 1] range. Same observation for instances with an RC label of 1. This confirms that response-level outcomes do not reflect the model's true ability to answer a query.
Figure 3: Cost-performance tradeoff of different methods. We compare inference cost (x-axis, log-scale) against calibration performance (y-axis, 1 - Brier score), where the upper-left corner is the ideal region. Among evaluated methods, probing is the only one that consistently falls in this region (see Figure \ref{['fig:cost-performance-fullplot']}). For readability, we only plot the best-calibrated probe.
Figure 4: Cost-performance tradeoff of different confidence estimation methods with three LLMs on seven datasets. Following Figure \ref{['fig:cost-performance-subplot']}, we compare inference cost (x-axis, average response tokens) against calibration performance (y-axis). Probing consistently outperforms the random baseline while incurring the lowest cost, while response consistency incurs a cost higher than decoding responses.
Figure 5: Inference budget allocation performance of capability-calibrated confidence. Given $N$ questions, we evaluate the performance (success rate) of different methods under the fixed inference budget $N\times B$. The Oracle capability-calibrated confidence achieves the best performance. Meanwhile, confidence estimators (verbalized and Probe-MATH) both outperform the Uniform allocation in various budgets.
...and 6 more figures

Theorems & Definitions (4)

Theorem 1
Theorem 2
proof
proof

On Calibration of Large Language Models: From Response To Capability

TL;DR

Abstract

On Calibration of Large Language Models: From Response To Capability

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (4)