Table of Contents
Fetching ...

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

Abhinaba Basu

Abstract

We identify a routing paradox in hybrid recurrent-attention architectures: content-based routing - deciding which tokens deserve expensive attention - requires exactly the pairwise computation that routing is designed to avoid. Through 20+ controlled experiments across three tasks (a synthetic diagnostic, the Zoology MQAR benchmark, and HotpotQA), we map the routing landscape exhaustively. One layer of softmax attention creates a latent ~34-dimensional subspace enabling 98.4% routing precision; zero layers yield 1.2%. This subspace is invisible to cosine similarity, destroyed by random projections (98.4% to 2.6%), and cannot be created by contrastive pretraining - proving attention's role is writing pairwise match results into representations, not merely computing them. Twelve alternative mechanisms all cluster at 15-29%. Non-learned indices (Bloom filter: 90.9%; BM25 on HotpotQA: 82.7%) bypass the bottleneck entirely. The result is a sharp two-regime hierarchy with an empty middle ground. These findings provide the mechanistic explanation for the empirical observation that recurrent models fail at associative recall, and reframe attention as a representation constructor rather than merely a computation mechanism.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

Abstract

We identify a routing paradox in hybrid recurrent-attention architectures: content-based routing - deciding which tokens deserve expensive attention - requires exactly the pairwise computation that routing is designed to avoid. Through 20+ controlled experiments across three tasks (a synthetic diagnostic, the Zoology MQAR benchmark, and HotpotQA), we map the routing landscape exhaustively. One layer of softmax attention creates a latent ~34-dimensional subspace enabling 98.4% routing precision; zero layers yield 1.2%. This subspace is invisible to cosine similarity, destroyed by random projections (98.4% to 2.6%), and cannot be created by contrastive pretraining - proving attention's role is writing pairwise match results into representations, not merely computing them. Twelve alternative mechanisms all cluster at 15-29%. Non-learned indices (Bloom filter: 90.9%; BM25 on HotpotQA: 82.7%) bypass the bottleneck entirely. The result is a sharp two-regime hierarchy with an empty middle ground. These findings provide the mechanistic explanation for the empirical observation that recurrent models fail at associative recall, and reframe attention as a representation constructor rather than merely a computation mechanism.
Paper Structure (40 sections, 1 equation, 6 figures, 2 tables)

This paper contains 40 sections, 1 equation, 6 figures, 2 tables.

Figures (6)

  • Figure 1: The routing paradox. (a) Hybrid architectures want cheap recurrence for most tokens and expensive attention for a few. A router decides which tokens get attention. (b) The paradox: the router needs attention-quality representations to identify which tokens deserve attention --- but generating those representations IS the expensive computation. (c) Our experiments reveal two working regimes (non-learned indices and softmax attention) with a large gap between them.
  • Figure 2: Phase transition at one Transformer layer. (a) Routing precision jumps 82$\times$ from 1.2% to 98.4% between 0 and 1 layers; additional layers provide no gain. (b) Over training, the transition occurs in a single epoch (epoch 10), a discrete regime change rather than gradual improvement.
  • Figure 3: The routing signal lives in a latent subspace. (a) Cosine similarity between query and answer representations is negative in the successful condition (1L Transformer) --- matching tokens are not geometrically close. (b) Replacing learned routing projections with random matrices drops routing from 98.4% to 2.6%, confirming the signal requires specific learned access. (c) SVD of the combined routing matrix $W_q W_k^\top$ shows 90% of energy in just 34 of 128 dimensions.
  • Figure 4: The routing landscape. Twenty approaches tested across non-learned indices, learned segment routing, contextual bandits, and contrastive pretraining. Two regimes work; everything else clusters at 1--29%.
  • Figure 5: Why contrastive pretraining fails. (a) Attention has three steps; contrastive loss replicates only step 1. (b) Step 3 (value aggregation) is irreplaceable: it writes match results into representations. (c) Contrastive pretraining achieves only 1.6--2.2%, no improvement over the 1.2% baseline.
  • ...and 1 more figures