From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs
Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma
TL;DR
This work extends the concept cone framework to propositional truth in LLMs, revealing a rich, multi-dimensional subspace that causally mediates truthful responses across model families. By combining activation addition, directional ablation, and a loss-guided cone discovery procedure, the authors identify orthogonal cone axes that collectively steer truth with minimal collateral drift, validated across multiple models and datasets. The findings show that truth is not confined to a single linear direction but can be captured by a cone with several basis vectors, enabling robust, targeted interventions while preserving general instruction-following behavior. These results advance mechanistic interpretability by providing a scalable, semi-local tool for probing abstract behaviors and raise considerations about potential vulnerabilities to manipulation and the need for interpretable labeling of the cone axes. The work sets a clear path for extending to larger, instruction-tuned models and multimodal settings, as well as for developing principled methods to map cone axes to semantically meaningful truth facets.
Abstract
Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.
