Table of Contents
Fetching ...

Closing the Confidence-Faithfulness Gap in Large Language Models

Miranda Muqing Miao, Lyle Ungar

Abstract

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

Closing the Confidence-Faithfulness Gap in Large Language Models

Abstract

Large language models (LLMs) tend to verbalize confidence scores that are largely detached from their actual accuracy, yet the geometric relationship governing this behavior remain poorly understood. In this work, we present a mechanistic interpretability analysis of verbalized confidence, using linear probes and contrastive activation addition (CAA) steering to show that calibration and verbalized confidence signals are encoded linearly but are orthogonal to one another -- a finding consistent across three open-weight models and four datasets. Interestingly, when models are prompted to simultaneously reason through a problem and verbalize a confidence score, the reasoning process disrupts the verbalized confidence direction, exacerbating miscalibration. We term this the "Reasoning Contamination Effect." Leveraging this insight, we introduce a two-stage adaptive steering pipeline that reads the model's internal accuracy estimate and steers verbalized output to match it, substantially improving calibration alignment across all evaluated models.

Paper Structure

This paper contains 32 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Ridge probe projection at layer 21 (Qwen-2.5-7B-Base).Left: Distribution of activations projected onto the probe weight vector, separated by correct (blue) and incorrect (pink) answers (Cohen's $d = 1.88$). Right: The same scalar projection plotted against binned empirical accuracy ($r = 0.80$). Takeaway: The model encodes well-calibrated accuracy information in a single linear direction, even when never asked about confidence.
  • Figure 2: High and low verbalized confidence occupy distinct regions of activation space (25th vs. 75th percentile split). First principal component of activations from the pure confidence prompt, colored by whether the model verbalized high or low confidence. Takeaway: Verbalized confidence is linearly separable in later layers, confirming that the model constructs a dedicated confidence representation during processing.
  • Figure 3: Probe fit and directional analysis across layers (Qwen-2.5-7B-Base).(a, b) Train and test $R^2$ of ridge probes predicting empirical accuracy (gold calibration, blue) and verbalized confidence (pure verbal, orange). (c) Cosine similarity between the two probe weight vectors (pure verbal vs. gold calibration). (d) Cosine similarity between contrastive confidence and accuracy directions, computed separately under the pure confidence prompt (blue) and the joint solve-and-rate prompt (red). Shaded region indicates the gap between the two conditions, the reasoning contamination effect. Takeaway: Accuracy and confidence are encoded in nearly orthogonal directions (cosine similarity $< 0.04$), and joint prompting inverts their relationship (from $+0.26$ to $-0.63$).
  • Figure 4: Prompt templates for three elicitation conditions. (a) The pure correctness prompt asks the model only to solve the problem, with no mention of confidence. (b) The pure confidence prompt asks the model only to rate its confidence, without producing a solution. (c) The joint prompt asks the model to first rate its confidence and then solve the problem. Separating these conditions allows us to isolate the model's confidence representation from the computational process of problem-solving.
  • Figure 5: Subspace orthogonality analysis between gold calibration and verbalized confidence representations across transformer layers. (a) Mean principal angle between 10-dimensional predictive subspaces extracted via iterative ridge regression with deflation; the gray band shows the $\pm 2\sigma$ range for random subspace pairs of equal dimensionality. (b) Top two canonical correlations from CCA applied to the 5-dimensional projections of each concept's subspace. (c)$R^2$ retention ratio after projecting out the other concept's top-10 subspace (cross-concept removal) versus projecting out one's own subspace (self-removal control). (d) Variance decomposition showing unique and shared $R^2$ for each concept, where shared $R^2$ is measured by predicting one target using only the other concept's subspace directions. Across all four analyses and all layers, the two representations occupy nearly orthogonal subspaces with negligible shared structure.