Table of Contents
Fetching ...

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

Stanley Yu, Vaidehi Bulusu, Oscar Yasunaga, Clayton Lau, Cole Blondin, Sean O'Brien, Kevin Zhu, Vasu Sharma

TL;DR

This work extends the concept cone framework to propositional truth in LLMs, revealing a rich, multi-dimensional subspace that causally mediates truthful responses across model families. By combining activation addition, directional ablation, and a loss-guided cone discovery procedure, the authors identify orthogonal cone axes that collectively steer truth with minimal collateral drift, validated across multiple models and datasets. The findings show that truth is not confined to a single linear direction but can be captured by a cone with several basis vectors, enabling robust, targeted interventions while preserving general instruction-following behavior. These results advance mechanistic interpretability by providing a scalable, semi-local tool for probing abstract behaviors and raise considerations about potential vulnerabilities to manipulation and the need for interpretable labeling of the cone axes. The work sets a clear path for extending to larger, instruction-tuned models and multimodal settings, as well as for developing principled methods to map cone axes to semantically meaningful truth facets.

Abstract

Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

From Directions to Cones: Exploring Multidimensional Representations of Propositional Facts in LLMs

TL;DR

This work extends the concept cone framework to propositional truth in LLMs, revealing a rich, multi-dimensional subspace that causally mediates truthful responses across model families. By combining activation addition, directional ablation, and a loss-guided cone discovery procedure, the authors identify orthogonal cone axes that collectively steer truth with minimal collateral drift, validated across multiple models and datasets. The findings show that truth is not confined to a single linear direction but can be captured by a cone with several basis vectors, enabling robust, targeted interventions while preserving general instruction-following behavior. These results advance mechanistic interpretability by providing a scalable, semi-local tool for probing abstract behaviors and raise considerations about potential vulnerabilities to manipulation and the need for interpretable labeling of the cone axes. The work sets a clear path for extending to larger, instruction-tuned models and multimodal settings, as well as for developing principled methods to map cone axes to semantically meaningful truth facets.

Abstract

Large Language Models (LLMs) exhibit strong conversational abilities but often generate falsehoods. Prior work suggests that the truthfulness of simple propositions can be represented as a single linear direction in a model's internal activations, but this may not fully capture its underlying geometry. In this work, we extend the concept cone framework, recently introduced for modeling refusal, to the domain of truth. We identify multi-dimensional cones that causally mediate truth-related behavior across multiple LLM families. Our results are supported by three lines of evidence: (i) causal interventions reliably flip model responses to factual statements, (ii) learned cones generalize across model architectures, and (iii) cone-based interventions preserve unrelated model behavior. These findings reveal the richer, multidirectional structure governing simple true/false propositions in LLMs and highlight concept cones as a promising tool for probing abstract behaviors.

Paper Structure

This paper contains 53 sections, 9 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Theoretical visualization of a 2D concept cone. All directions in the cone should causally mediate truthful behavior. Given a true propositional input (e.g., “Paris is the capital of France”), ablating along any basis vector of this cone disrupts the model’s ability to generate a truthful response.
  • Figure 2: The Attack Success Rate (ASR) of one dimensional cones across layers for Qwen and Gemma models. The layer numbers have been normalized across larger and smaller models. The effectiveness spikes rapidly in all models in the 0.60-0.75 range of normalized layer numbers.
  • Figure 3: The Answer Switching Rate (ASR) of cones from dimensions 1 to 5 across Qwen2.5 and Gemma2 models with boxplots showing the Monte Carlo sampling.
  • Figure 4: Projections of Gemma-2-9B, representations of datasets onto their top two PCs, across all layers.
  • Figure 5: Projections of Qwen2.5-7B representations of datasets onto their top two PCs, across all layers.

Theorems & Definitions (4)

  • Definition 2.1
  • Definition 3.1
  • Definition 3.2
  • Definition 3.3