Table of Contents
Fetching ...

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Gregory N. Frank

Abstract

This paper identifies a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, the mechanism is traced across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. Modulating the detection-layer signal continuously controls policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head's interchange necessity collapses 70-99% across three models (n=120), and the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.

How Alignment Routes: Localizing, Scaling, and Controlling Policy Circuits in Language Models

Abstract

This paper identifies a recurring sparse routing mechanism in alignment-trained language models: a gate attention head reads detected content and triggers downstream amplifier heads that boost the signal toward refusal. Using political censorship and safety refusal as natural experiments, the mechanism is traced across 9 models from 6 labs, all validated on corpora of 120 prompt pairs. The gate head passes necessity and sufficiency interchange tests (p < 0.001, permutation null), and core amplifier heads are stable under bootstrap resampling (Jaccard 0.92-1.0). Three same-generation scaling pairs show that routing distributes at scale (ablation up to 17x weaker) while remaining detectable by interchange. Modulating the detection-layer signal continuously controls policy strength from hard refusal through steering to factual compliance, with routing thresholds that vary by topic. The circuit also reveals a structural separation between intent recognition and policy routing: under cipher encoding, the gate head's interchange necessity collapses 70-99% across three models (n=120), and the model responds with puzzle-solving rather than refusal. The routing mechanism never fires, even though probe scores at deeper layers indicate the model begins to represent the harmful content. This asymmetry is consistent with different robustness properties of pretraining and post-training: broad semantic understanding versus narrower policy binding that generalizes less well under input transformation.

Paper Structure

This paper contains 62 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Routing mechanism overview. Detection forms at layers 15--16. A gate head writes a routing vector; amplifier heads boost it toward refusal. MLP pathways carry topic-specific signal in parallel. Modulating the detection-layer input moves output between refusal and factual answering.
  • Figure 2: Routing is prompt-time and contextual (Qwen3-8B).Left: Per-layer DLA at the last prompt and first generated token overlap. Right: Same keyword, different framing, different layer-16 probe scores; annotated edge cases confirm routing is not a simple threshold.
  • Figure 3: The invisible shift across the Qwen family.Left: Refusal drops from 33% to 0% while steering rises. Right: Top-1 routing head DLA amplitude peaks in Qwen3-8B and falls sharply in Qwen3.5; total routing signal drops.
  • Figure 4: Three-step discovery pipeline (Qwen3-8B, $n{=}24$ discovery corpus).Left: Per-head DLA heatmap; deep layers dominate. Center: Head-level ablation; layers 22--23 dominate, L22.H7 leads, L17.H17 is sixth. Right: Necessity $\times$ sufficiency; L17.H17 has the strongest combined score by a wide margin.
  • Figure 5: Gate knockout cascade in three architectures ($n{=}120$). Paired bars show each amplifier head before (blue) and after (red) gate ablation. Qwen3-8B: 5/6 amplifiers suppressed 5--26%. Phi-4-mini: 3/5 amplifiers suppressed 6--16%. Gemma-2-2B: 3/5 amplifiers suppressed 2--10%.
  • ...and 5 more figures