The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Seonglae Cho; Zekun Wu; Kleyton Da Costa; Adriano Koshiyama

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama

TL;DR

This work reveals that the model-internal signal for correctness is encoded in a low-dimensional subspace of the residual activations, typically $3$--$8$ dimensions, and behaves like a mean-shift between correct and incorrect distributions. A simple centroid-based detector in this subspace achieves near parity with trained probes ($ ext{AUC} oughly 0.90$) and enables few-shot detection; importantly, nonlinear classifiers do not outperform a linear hyperplane. Activation steering demonstrates causal influence: perturbing along the learned direction yields a $10.9$ percentage point change in error rates, while controls do not. Internal representations outperform output-based uncertainty measures, exposing a fundamental gap between what the model knows and what it reveals, with cross-domain transfer improved when projecting onto the discriminative subspace. These findings suggest a universal, architecture-agnostic structure for correctness signals and offer practical, low-cost detection and control tools for factual integrity in language models.

Abstract

When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

TL;DR

This work reveals that the model-internal signal for correctness is encoded in a low-dimensional subspace of the residual activations, typically

dimensions, and behaves like a mean-shift between correct and incorrect distributions. A simple centroid-based detector in this subspace achieves near parity with trained probes (

) and enables few-shot detection; importantly, nonlinear classifiers do not outperform a linear hyperplane. Activation steering demonstrates causal influence: perturbing along the learned direction yields a

percentage point change in error rates, while controls do not. Internal representations outperform output-based uncertainty measures, exposing a fundamental gap between what the model knows and what it reveals, with cross-domain transfer improved when projecting onto the discriminative subspace. These findings suggest a universal, architecture-agnostic structure for correctness signals and offer practical, low-cost detection and control tools for factual integrity in language models.

Abstract

Paper Structure (43 sections, 2 equations, 8 figures, 11 tables)

This paper contains 43 sections, 2 equations, 8 figures, 11 tables.

Introduction
Background
Method
Data Construction and Direction Learning
Geometric Analysis
Causal Validation via Activation Steering
Why Direct Labels, Not Entropy?
Experiments
Datasets
Models
Baselines
Probing Protocol
Results
The Confidence Signal is 3--8 Dimensional
Linear Separability in the Discriminative Subspace
...and 28 more sections

Figures (8)

Figure 1: Layer-wise evolution across 9 models. (a) Detection performance peaks at different depths: GPT-2 family at final layers (100%), instruction-tuned models at mid-layers (43--75%). (b) Intrinsic dimension decreases through layers, converging to 8--12D at optimal layers.
Figure 2: Steering intervention analysis. Error rate on held-out TruthfulQA questions vs. steering coefficient $\alpha \in [-5, 5]$. Interventions modify the forward pass at the optimal layer: $\mathbf{h}' = \mathbf{h} + \alpha \cdot \hat{\mathbf{w}}$. The learned confidence direction (green) produces a monotonic 10.9 percentage point swing: $\alpha = -5$ increases error rate to 0.63 (steering toward uncertainty), $\alpha = +5$ decreases it to 0.52 (steering toward confidence). Random directions (gray, $\mathbf{r} \sim \mathcal{N}(0, I)$) and orthogonal directions (orange, $\mathbf{r}_\perp$) show no systematic effect, remaining at baseline 0.56.
Figure 3: 3D PLS visualization of the confidence manifold. Row 1: instruction-tuned models (Qwen2-7B, Mistral-7B, Llama-3B). Row 2: GPT-2 family (base models). Convex hulls show class regions; stars mark centroids. GPT-2 family shows clearer visual separation despite lower AUC (0.80--0.84), while instruction-tuned models achieve higher AUC (0.91--0.97) with more overlap in 3D projection. See Appendix \ref{['app:instruct_small_manifold']} for smaller instruction-tuned models.
Figure 4: Universal geometric patterns across architectures. (a) Normalized intrinsic dimension (MLE) by layer depth. All models compress from early to late layers (mean curve in black), with peak dimension at 10--20% depth. (b) Dimension-performance correlation: lower intrinsic dimension correlates with higher probe AUC ($r = -0.43$, $p < 0.001$), but $R^2 = 0.18$ indicates dimension explains less than one-fifth of variance; classification utility depends on direction, not dimensionality. (c) Cross-layer probe weight similarity averaged across models shows three-phase block-diagonal structure with phase boundaries at 30% and 70% depth.
Figure 5: Intrinsic dimension evolution by architecture. (a) Raw MLE estimates show all models compress from 20--55D to 8--12D, except Mistral-7B which exhibits late-layer expansion (80--100D at 90%+ depth). (b) Normalized dimension enables cross-model comparison: models follow a common compression trajectory until 80% depth, after which Mistral diverges due to unembedding preparation.
...and 3 more figures

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

TL;DR

Abstract

The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (8)