Table of Contents
Fetching ...

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Javier Marín

Abstract

When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Abstract

When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.
Paper Structure (25 sections, 4 equations, 3 figures, 4 tables)

This paper contains 25 sections, 4 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Trajectory similarity $\tau(\ell)$ between correct and incorrect answer hidden states across normalized depth. Deep constraint (red) and control (blue) queries diverge maximally at $\ell/L \approx 0.3\text{--}0.4$; neutral queries (gray) show less divergence. The pattern is absent in Qwen2 1.5B (not shown; $\tau > 0.99$ at all layers). Shaded regions: $\pm 1$ SEM.
  • Figure 2: Commitment ratio $\kappa(\ell)$ across normalized depth. Solid lines: correct answers by category type. Dashed gray: incorrect answers (all categories pooled). In LLaMA-2 and Mistral, $\kappa$ for incorrect answers collapses below 0.10, indicating active suppression. StableLM-2 shows a weaker effect; Qwen2 1.5B (not shown) remains at 0.50 throughout.
  • Figure 3: Left: Probe accuracy (5-fold CV) across normalized depth. All three models peak at intermediate layers, with accuracy declining toward the final layer. Circles mark peaks. Right: Cross-domain transfer AUROC. Within-domain (blue) is near-perfect; cross-domain (red) degrades, particularly for StableLM-2. Qwen2 1.5B (not shown) achieves 0.50 throughout.

Theorems & Definitions (5)

  • Definition 1: Displacement field
  • Definition 2: Rotational divergence
  • Definition 3: Commitment ratio
  • Definition 4: Active suppression
  • Definition 5: Attention allocation ratio