How do LLMs Compute Verbal Confidence

Dharshan Kumaran; Arthur Conmy; Federico Barbero; Simon Osindero; Viorica Patraucean; Petar Velickovic

How do LLMs Compute Verbal Confidence

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Velickovic

Abstract

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

How do LLMs Compute Verbal Confidence

Abstract

Paper Structure (42 sections, 12 equations, 19 figures)

This paper contains 42 sections, 12 equations, 19 figures.

Introduction
Experiments
Activation Steering
Activation Patching
Activation Noising
Activation Swap Experiment
Decoding Confidence Information
Attention Blocking Experiments
Generalization across Prompt Format and Architecture
Related Work
Conclusion
Related Work
Confidence and Calibration in LLMs.
Latent Representations of Uncertainty and Correctness.
Mechanistic Interpretability: Activation Steering
...and 27 more sections

Figures (19)

Figure 1: Main Prompt and Illustration of our findings. We included the generated answer (example question shown) from a previous phase as part of the prompt for the confidence rating experiment (see §\ref{['app:answer_generation']}). Since the Transformer's forward pass is a function of previous tokens, providing the answer as context yields the exact same representation at the PANL as autoregressive generation. See §\ref{['app:figures']} for full prompt used. We provide convergent evidence that LLMs compute confidence via cached retrieval rather than just-in-time computation, and that verbal confidence doesn't merely reflect logprobs. (A) Confidence information is gathered at the post-answer-newline token ($\backslash$n, PANL) via attention to answer tokens, particularly the last answer token, at earlier layers (21--25). (B) This information is routed to the confidence-colon token (:)---either directly or through intermediate tokens. (C) Confidence information persists in the residual stream at the confidence-colon through later layers (30--35). (D) Confidence is verbalized when CC's representation is transformed by the unembedding matrix at the final layer (layer 61). (E) Attention blocking experiments rule out just-in-time (JIT) computation: CC does not compute confidence from scratch by attending to question or answer tokens (red arrow from CC to Q and A tokens). (F) Decoding experiments reveal that verbal confidence is not explained by token log-probabilities, suggesting they reflect a more sophisticated evaluation of question-answer fit.
Figure 2: Results of Activation Steering In Gemma 3 27B. High (green lines) and low confidence (red lines) steering, at scales of 2 (solid line) and 5 (dotted line). Key positions: PANL (post-answer-newline) token, CC (confidence-colon) token. Control positions: PANL+1 (token immediately after PANL), FCC (first-confidence-colon) token (i.e. token preceding "$CLASS" in the prompt, following the confidence instructions; see Figure \ref{['fig:prompt_full']}). Baseline confidence was 0.55 across all trials. See Error bars show SEM (n=200 trials).
Figure 3: Results of Activation Patching in High Confidence Trials: Confidence Class Prompt. Clean baseline shown in green; corrupt baseline shown in red (i.e. at near zero for logit difference and confidence, and near 100 for first token change rate). Patching of PANL representation resulted in partial recovery of logit difference, first token and confidence (upper, middle, lower panel respectively). Patching of CC representation resulted in near complete recovery of confidence, logit difference and first token. PANL+1 patching resulting in effectively zero recovery.
Figure 4: Illustration of Activation Swap Experiment. Upper panel: High$\rightarrow$Low (i.e. cross-confidence swap: high confidence recipient trial receives low confidence donor representation) -- result is a lowering of confidence. Lower panel: High$\rightarrow$High (i.e. high confidence recipient trial receives High confidence donor representation) -- in this same-same confidence swap, the result is no change in confidence.
Figure 5: Results of Activation Swap Experiment at PANL position. Same-confidence swaps (H$\rightarrow$H, L$\rightarrow$L) control for generic content-related effects due to introducing activations from a different trial; cross-confidence swaps (H$\rightarrow$L, L$\rightarrow$H) isolate confidence-specific transfer. See main text for details. See Figure \ref{['fig:gemma_swap_suppl']} for results at CC and PANL+1 positions.
...and 14 more figures

How do LLMs Compute Verbal Confidence

Abstract

How do LLMs Compute Verbal Confidence

Authors

Abstract

Table of Contents

Figures (19)