How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Javier Marín

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Javier Marín

Abstract

When a language model is fed a wrong answer, what happens inside the network? Current understanding treats truthfulness as a static property of individual-layer representations-a direction to be probed, a feature to be extracted. Less is known about the dynamics: how internal representations diverge across the full depth of the network when the model processes correct versus incorrect continuations. We introduce forced-completion probing, a method that presents identical queries with known correct and incorrect single-token continuations and tracks five geometric measurements across every layer of four decoder-only models(1.5B-13B parameters). We report three findings. First, correct and incorrect paths diverge through rotation, not rescaling: displacement vectors maintain near-identical magnitudes while their angular separation increases, meaning factual selection is encoded in direction on an approximate hypersphere. Second, the model does not passively fail on incorrect input-it actively suppresses the correct answer, driving internal probability away from the right token. Third, both phenomena are entirely absent below a parameter threshold and emerge at 1.6B, suggesting a phase transition in factual processing capability. These results show that factual constraint processing has a specific geometric character-rotational, not scalar; active, not passive-that is invisible to methods based on single-layer probes or magnitude comparisons.

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Abstract

Paper Structure (25 sections, 4 equations, 3 figures, 4 tables)

This paper contains 25 sections, 4 equations, 3 figures, 4 tables.

Introduction
Related Work
Representation geometry
Probing truthfulness in hidden states
Factual recall mechanisms
Logit lens and iterative inference
Geometric Measurements
Experimental Setup
Experimental Dataset
Models
Hidden State Extraction
Token-span matching.
Implementation
Statistical Analysis
Results
...and 10 more sections

Figures (3)

Figure 1: Trajectory similarity $\tau(\ell)$ between correct and incorrect answer hidden states across normalized depth. Deep constraint (red) and control (blue) queries diverge maximally at $\ell/L \approx 0.3\text{--}0.4$; neutral queries (gray) show less divergence. The pattern is absent in Qwen2 1.5B (not shown; $\tau > 0.99$ at all layers). Shaded regions: $\pm 1$ SEM.
Figure 2: Commitment ratio $\kappa(\ell)$ across normalized depth. Solid lines: correct answers by category type. Dashed gray: incorrect answers (all categories pooled). In LLaMA-2 and Mistral, $\kappa$ for incorrect answers collapses below 0.10, indicating active suppression. StableLM-2 shows a weaker effect; Qwen2 1.5B (not shown) remains at 0.50 throughout.
Figure 3: Left: Probe accuracy (5-fold CV) across normalized depth. All three models peak at intermediate layers, with accuracy declining toward the final layer. Circles mark peaks. Right: Cross-domain transfer AUROC. Within-domain (blue) is near-perfect; cross-domain (red) degrades, particularly for StableLM-2. Qwen2 1.5B (not shown) achieves 0.50 throughout.

Theorems & Definitions (5)

Definition 1: Displacement field
Definition 2: Rotational divergence
Definition 3: Commitment ratio
Definition 4: Active suppression
Definition 5: Attention allocation ratio

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Abstract

How Transformers Reject Wrong Answers: Rotational Dynamics of Factual Constraint Processing

Authors

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (5)