Table of Contents
Fetching ...

Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

Keston Aquino-Michaels

TL;DR

Routing absorption is connected to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE.

Abstract

Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model's Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.

Routing Absorption in Sparse Attention: Why Random Gates Are Hard to Beat

TL;DR

Routing absorption is connected to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE.

Abstract

Can a transformer learn which attention entries matter during training? In principle, yes: attention distributions are highly concentrated, and a small gate network can identify the important entries post-hoc with near-perfect accuracy. In practice, barely. When sparse attention is trained end-to-end, the model's Q/K/V projections co-adapt to whatever mask is imposed, absorbing the routing signal until learned gates perform little better than frozen random gates. We call this routing absorption and present four independent lines of evidence for it in a controlled 31M-parameter transformer: (1) differentiable soft gating converges to nearly the same perplexity whether the gate is learned or random (48.73 +/- 0.60 vs. 49.83 +/- 0.04 over 3 seeds); (2) hard top-k gating receives exactly zero gradient through the mask; (3) a gate distilled onto co-adapted Q/K/V achieves high F1 against oracle masks but catastrophic perplexity when deployed (601.6 vs. 48.6 on mask-agnostic Q/K/V); and (4) stochastic mask randomization during training fails to prevent co-adaptation (78.2 ppl deployed dense vs. 37.3 baseline). We connect routing absorption to the same phenomenon in Mixture-of-Experts, where random routing matches learned routing because experts co-adapt to any router, but show that attention exhibits a structurally more severe form: shared Q/K/V parameters enable cross-layer compensation pathways absent in MoE, where experts are self-contained modules. The implication is that end-to-end sparse attention methods employing per-query token-level gating face absorption pressure proportional to the parameter asymmetry between the gate and the model, and that post-hoc approaches, which decouple representation learning from sparsification, sidestep this entirely.
Paper Structure (29 sections, 4 figures, 9 tables)

This paper contains 29 sections, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Attention concentration by layer in the 31M model. Each bar shows the fraction of total attention mass captured by the 64 highest-weight positions out of 512. Layers 2 and 3 are nearly one-hot, concentrating 100% of mass in the top 64; even the least concentrated layer captures 66%. The dashed line shows the average across all layers (90.6%).
  • Figure 2: Convergence dynamics under decoupled vs. co-adapted training. (a) Post-hoc gate-only training on the frozen 31M model: the learned gate converges from 46.8 to 37.3 ppl in 500 steps ($>$99% of total improvement), while the frozen random gate stays flat. (b) Single-layer co-adaptation at Qwen3 scale: learned and random gates converge to identical perplexity (8.80), with the no-gate control reaching 8.70; the gate adds overhead rather than signal. The contrast between panels is the core of the absorption argument: the same gate architecture solves routing in hundreds of steps when decoupled, but provides zero benefit when Q/K/V can co-adapt.
  • Figure 3: The absorption gradient at Qwen3-1.7B scale. As more layers unfreeze (increasing co-adaptation capacity), the random gate's perplexity drops toward the learned gate's level. The shaded area shows the gap shrinking from 31.5 (post-hoc, no co-adaptation) to 6.9 (29% of layers unfrozen). The learned gate remains flat at $\sim$10 ppl in all conditions; the improvement comes entirely from Q/K/V absorbing the routing signal.
  • Figure 4: Perplexity vs. sparsity for oracle, KL-distilled gate, BCE-distilled gate, and random masking at two scales. (a) On the 31M model, KL and BCE gates perform similarly, both tracking oracle closely. (b) On Qwen3-1.7B (log scale), KL distillation tracks oracle at all sparsity levels while BCE diverges catastrophically at high sparsity (102.8 vs. 12.2 at 87.5% sparsity). The gap reveals that ranking information, preserved by KL but discarded by BCE, is critical when attention is sharply peaked.