Table of Contents
Fetching ...

Directional Routing in Transformers

Kevin Taylor

Abstract

We introduce directional routing, a lightweight mechanism that gives each transformer attention head learned suppression directions controlled by a shared router, at 3.9% parameter cost. We train a 433M-parameter model alongside an identical baseline in a single run, then trace the resulting circuits through mechanistic interpretability. Routing becomes the model's dominant computational pathway. Disabling it collapses factual recall to near-zero probability across all 8 test prompts and drops induction accuracy from 93.4% to 0.0%. Knocking out individual attention heads has negligible effect: the primary mover head's removal actually increases target probability, and induction heads retain 98.6% accuracy without their strongest member. The coordination mechanism is irreplaceable; the components it coordinates are not. The model also self-organizes, without explicit pressure, into two regimes: domain-adaptive routing in early layers and fixed syntactic pruning in late layers, where the least-varying layer is the most critical (+42.6 PPL when disabled). Routing reduces perplexity 31-56% relative to the baseline, though downstream multiple-choice benchmarks do not yet reflect these gains.

Directional Routing in Transformers

Abstract

We introduce directional routing, a lightweight mechanism that gives each transformer attention head learned suppression directions controlled by a shared router, at 3.9% parameter cost. We train a 433M-parameter model alongside an identical baseline in a single run, then trace the resulting circuits through mechanistic interpretability. Routing becomes the model's dominant computational pathway. Disabling it collapses factual recall to near-zero probability across all 8 test prompts and drops induction accuracy from 93.4% to 0.0%. Knocking out individual attention heads has negligible effect: the primary mover head's removal actually increases target probability, and induction heads retain 98.6% accuracy without their strongest member. The coordination mechanism is irreplaceable; the components it coordinates are not. The model also self-organizes, without explicit pressure, into two regimes: domain-adaptive routing in early layers and fixed syntactic pruning in late layers, where the least-varying layer is the most critical (+42.6 PPL when disabled). Routing reduces perplexity 31-56% relative to the baseline, though downstream multiple-choice benchmarks do not yet reflect these gains.
Paper Structure (28 sections, 2 equations, 5 figures, 7 tables)

This paper contains 28 sections, 2 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Standard transformer block (left) vs. routed transformer block (right). The routing mechanism (dashed box) adds three components: a shared router MLP that produces per-input weights $r_{h,k} \in [0,1]$ from the mean-pooled sequence, learned direction vectors $\mathbf{d}_{h,k}$, and a directional suppression step that removes selected components from each head's output. All other components are identical.
  • Figure 2: Per-layer routing weight distributions. Layer 0 shows the widest spread ($\sigma{=}0.272$), consistent with domain-adaptive specialist routing. Layer 7 is bimodal ($\sigma{=}0.290$, 14.8% near-zero). Layer 9 never produces weights above 0.9 despite being the most critical layer. All layer means exceed 0.5 (range 0.508--0.674); the model prefers suppression over preservation on average.
  • Figure 3: Routing benefit ($\Delta$ log-prob) by token position. Decays from $+1.35$ at positions 0--9 to $+0.19$ at 100+, a direct consequence of the mean-pooling bottleneck.
  • Figure 4: Direction geometry. Mean within-head angle: 75.9° (vs. 89.4° random in 128D). Effective rank: 112.9/128 (88% capacity usage). Mild superposition, no dimensional collapse.
  • Figure 5: Preliminary cross-scale results at 26M and 400M parameters. Parameter overhead is 3.9% at both scales. Two data points; not a scaling law.