The Confidence Manifold: Geometric Structure of Correctness Representations in Language Models
Seonglae Cho, Zekun Wu, Kleyton Da Costa, Adriano Koshiyama
TL;DR
This work reveals that the model-internal signal for correctness is encoded in a low-dimensional subspace of the residual activations, typically $3$--$8$ dimensions, and behaves like a mean-shift between correct and incorrect distributions. A simple centroid-based detector in this subspace achieves near parity with trained probes ($ ext{AUC} oughly 0.90$) and enables few-shot detection; importantly, nonlinear classifiers do not outperform a linear hyperplane. Activation steering demonstrates causal influence: perturbing along the learned direction yields a $10.9$ percentage point change in error rates, while controls do not. Internal representations outperform output-based uncertainty measures, exposing a fundamental gap between what the model knows and what it reveals, with cross-domain transfer improved when projecting onto the discriminative subspace. These findings suggest a universal, architecture-agnostic structure for correctness signals and offer practical, low-cost detection and control tools for factual integrity in language models.
Abstract
When a language model asserts that "the capital of Australia is Sydney," does it know this is wrong? We characterize the geometry of correctness representations across 9 models from 5 architecture families. The structure is simple: the discriminative signal occupies 3-8 dimensions, performance degrades with additional dimensions, and no nonlinear classifier improves over linear separation. Centroid distance in the low-dimensional subspace matches trained probe performance (0.90 AUC), enabling few-shot detection: on GPT-2, 25 labeled examples achieve 89% of full-data accuracy. We validate causally through activation steering: the learned direction produces 10.9 percentage point changes in error rates while random directions show no effect. Internal probes achieve 0.80-0.97 AUC; output-based methods (P(True), semantic entropy) achieve only 0.44-0.64 AUC. The correctness signal exists internally but is not expressed in outputs. That centroid distance matches probe performance indicates class separation is a mean shift, making detection geometric rather than learned.
