Table of Contents
Fetching ...

The Truthfulness Spectrum Hypothesis

Zhuofan Josh Ying, Shauli Ravfogel, Nikolaus Kriegeskorte, Peter Hase

Abstract

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.

The Truthfulness Spectrum Hypothesis

Abstract

Large language models (LLMs) have been reported to linearly encode truthfulness, yet recent work questions this finding's generality. We reconcile these views with the truthfulness spectrum hypothesis: the representational space contains directions ranging from broadly domain-general to narrowly domain-specific. To test this hypothesis, we systematically evaluate probe generalization across five truth types (definitional, empirical, logical, fictional, and ethical), sycophantic and expectation-inverted lying, and existing honesty benchmarks. Linear probes generalize well across most domains but fail on sycophantic and expectation-inverted lying. Yet training on all domains jointly recovers strong performance, confirming that domain-general directions exist despite poor pairwise transfer. The geometry of probe directions explains these patterns: Mahalanobis cosine similarity between probes near-perfectly predicts cross-domain generalization (R^2=0.98). Concept-erasure methods further isolate truth directions that are (1) domain-general, (2) domain-specific, or (3) shared only across particular domain subsets. Causal interventions reveal that domain-specific directions steer more effectively than domain-general ones. Finally, post-training reshapes truth geometry, pushing sycophantic lying further from other truth types, suggesting a representational basis for chat models' sycophantic tendencies. Together, our results support the truthfulness spectrum hypothesis: truth directions of varying generality coexist in representational space, with post-training reshaping their geometry. Code for all experiments is provided in https://github.com/zfying/truth_spec.
Paper Structure (57 sections, 9 equations, 26 figures, 2 tables)

This paper contains 57 sections, 9 equations, 26 figures, 2 tables.

Figures (26)

  • Figure 1: Truth representations in LLMs are graded in generality and reshaped by post-training.Left: Different truth types share partially overlapping but distinct sets of truth directions. These directions lie on a spectrum from domain-general to domain-specific. The geometry of truth representations changes through post-training, pushing sycophancy into a more distant subspace from other truth types. This reorganization causes probes trained on factual truth to fail on sycophancy detection, and vice versa (X). However, training on all domains still yields a domain-general direction (X). Right: Concept erasure analysis further reveals the full spectrum of truth directions.
  • Figure 2: Probing Generalization Performance. We report the average AUROC for 5-fold cross-validation on Llama-3.3-70B. Probes trained on any one of our five truth types generalize to each other, but perform poorly on sycophantic and expectation-inverted lying. A probe trained on all domains generalizes well to all domains, performing on par with the best individual probe performance.
  • Figure 3: Mahalanobis cosine similarity linearly predicts OOD probe performance. Each point is a pair of datasets: the probe is trained on one and tested on the other. Mahalanobis cosine similarity achieves $R^2{=}0.98$, substantially outperforming standard cosine similarity ($R^2{=}0.56$; Figure \ref{['fig:geo-auroc-scossim']}).
  • Figure 4: Post-training reduces alignment between sycophantic lying and other truth types.(a,b) Base models show substantially better probe generalization between FLEED and sycophancy than chat models, indicating that post-training pushes sycophancy into a subspace more orthogonal to other truth types. (c) Probe direction similarity between FLEED and sycophancy is significantly higher in the base models compared to chat models. See similar results on Qwen family models in Appendix \ref{['sec:app-add-post-training']} and Figure \ref{['fig:post-training-qwen']}.
  • Figure 5: Stratified INLP Reveals Highly Domain-general and Domain-specific Directions.(a) Domain-general directions. Cross-domain accuracies for the first five mutually-orthogonal directions extracted by training on all domains jointly are high across all domains. (b) Domain-specific Directions. Accuracy for directions extracted from individual domains after the four domain-general directions have been projected out. While in-distribution accuracy ("Self"; blue) remains high, generalization to other domains ("Other"; yellow) drops toward chance ($0.5$; gray dashed line), indicating these directions encode truth information unique to a specific domain.
  • ...and 21 more figures