Table of Contents
Fetching ...

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama

TL;DR

This work reveals that the model-internal signal for correctness is encoded in a low-dimensional subspace of the residual activations, typically $3$--$8$ dimensions, and behaves like a mean-shift between correct and incorrect distributions. A simple centroid-based detector in this subspace achieves near parity with trained probes ($ ext{AUC} oughly 0.90$) and enables few-shot detection; importantly, nonlinear classifiers do not outperform a linear hyperplane. Activation steering demonstrates causal influence: perturbing along the learned direction yields a $10.9$ percentage point change in error rates, while controls do not. Internal representations outperform output-based uncertainty measures, exposing a fundamental gap between what the model knows and what it reveals, with cross-domain transfer improved when projecting onto the discriminative subspace. These findings suggest a universal, architecture-agnostic structure for correctness signals and offer practical, low-cost detection and control tools for factual integrity in language models.

Abstract

When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

TL;DR

This work reveals that the model-internal signal for correctness is encoded in a low-dimensional subspace of the residual activations, typically -- dimensions, and behaves like a mean-shift between correct and incorrect distributions. A simple centroid-based detector in this subspace achieves near parity with trained probes () and enables few-shot detection; importantly, nonlinear classifiers do not outperform a linear hyperplane. Activation steering demonstrates causal influence: perturbing along the learned direction yields a percentage point change in error rates, while controls do not. Internal representations outperform output-based uncertainty measures, exposing a fundamental gap between what the model knows and what it reveals, with cross-domain transfer improved when projecting onto the discriminative subspace. These findings suggest a universal, architecture-agnostic structure for correctness signals and offer practical, low-cost detection and control tools for factual integrity in language models.

Abstract

When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.
Paper Structure (43 sections, 2 equations, 8 figures, 11 tables)

This paper contains 43 sections, 2 equations, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Layer-wise evolution across 9 models. (a) Detection performance peaks at different depths: GPT-2 family at final layers (100%), instruction-tuned models at mid-layers (43--75%). (b) Intrinsic dimension decreases through layers, converging to 8--12D at optimal layers.
  • Figure 2: Steering intervention analysis. Error rate on held-out TruthfulQA questions vs. steering coefficient $\alpha \in [-5, 5]$. Interventions modify the forward pass at the optimal layer: $\mathbf{h}' = \mathbf{h} + \alpha \cdot \hat{\mathbf{w}}$. The learned confidence direction (green) produces a monotonic 10.9 percentage point swing: $\alpha = -5$ increases error rate to 0.63 (steering toward uncertainty), $\alpha = +5$ decreases it to 0.52 (steering toward confidence). Random directions (gray, $\mathbf{r} \sim \mathcal{N}(0, I)$) and orthogonal directions (orange, $\mathbf{r}_\perp$) show no systematic effect, remaining at baseline 0.56.
  • Figure 3: 3D PLS visualization of the confidence manifold. Row 1: instruction-tuned models (Qwen2-7B, Mistral-7B, Llama-3B). Row 2: GPT-2 family (base models). Convex hulls show class regions; stars mark centroids. GPT-2 family shows clearer visual separation despite lower AUC (0.80--0.84), while instruction-tuned models achieve higher AUC (0.91--0.97) with more overlap in 3D projection. See Appendix \ref{['app:instruct_small_manifold']} for smaller instruction-tuned models.
  • Figure 4: Universal geometric patterns across architectures. (a) Normalized intrinsic dimension (MLE) by layer depth. All models compress from early to late layers (mean curve in black), with peak dimension at 10--20% depth. (b) Dimension-performance correlation: lower intrinsic dimension correlates with higher probe AUC ($r = -0.43$, $p < 0.001$), but $R^2 = 0.18$ indicates dimension explains less than one-fifth of variance; classification utility depends on direction, not dimensionality. (c) Cross-layer probe weight similarity averaged across models shows three-phase block-diagonal structure with phase boundaries at 30% and 70% depth.
  • Figure 5: Intrinsic dimension evolution by architecture. (a) Raw MLE estimates show all models compress from 20--55D to 8--12D, except Mistral-7B which exhibits late-layer expansion (80--100D at 90%+ depth). (b) Normalized dimension enables cross-model comparison: models follow a common compression trajectory until 80% depth, after which Mistral diverges due to unembedding preparation.
  • ...and 3 more figures