Table of Contents
Fetching ...

Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Zhen Zhang, Javen Shi

TL;DR

This work introduces Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement, and establishes trajectory analysis as a complementary perspective on LLM explainability.

Abstract

Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.

Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

TL;DR

This work introduces Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement, and establishes trajectory analysis as a complementary perspective on LLM explainability.

Abstract

Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.
Paper Structure (42 sections, 7 equations, 3 figures, 11 tables)

This paper contains 42 sections, 7 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Trajectories reveal structure beyond static embeddings. We plot layerwise hidden states as trajectories in activation space. Correct generations (green) follow smoother paths, while incorrect ones (red) exhibit sharp deviations. Although the supervised projection amplifies separation, the geometry suggests that modeling entire trajectories, rather than isolated states, can help distinguish valid from spurious reasoning.
  • Figure 2: Performance on 4 reasoning benchmarks using kinematic descriptors. The red dashed line is the random classifier. While activation velocity obtains better results than the base model, there is no consistency in this performance improvement across datasets.
  • Figure 3: Top performance of Qwen2.5-14b on 4 reasoning benchmarks using kinematic descriptors with varying rule-sets. Red dashed line represents the random classifier accuracy. While the velocity of activations obtains better results than the base model itself, there is no consistency in this performance improvement across datasets despite the oracle-guided approach to evaluation.