Table of Contents
Fetching ...

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

Zehao Jin, Yanan Sui

Abstract

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same $O(nw)$ per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in $O(\log_w n)$ layers versus $O(n/w)$ for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

Stochastic Attention: Connectome-Inspired Randomized Routing for Expressive Linear-Time Attention

Abstract

The whole-brain connectome of a fruit fly comprises over 130K neurons connected with a probability of merely 0.02%, yet achieves an average shortest path of only 4.4 hops. Despite being highly structured at the circuit level, the network's long-range connections are broadly distributed across brain regions, functioning as stochastic shortcuts that enable efficient global communication. Inspired by this observation, we propose Stochastic Attention (SA), a drop-in enhancement for sliding-window attention (SWA) that applies a random permutation to the token sequence before windowed attention and restores the original order afterward. This transforms the fixed local window into a stochastic global one within the same per-layer budget. Through depth, independently sampled permutations yield exponentially growing receptive fields, achieving full sequence coverage in layers versus for SWA. We validate SA in two settings: pre-training language models from scratch, where a gated SA + SWA combination achieves the best average zero-shot accuracy, and training-free inference on Qwen3-8B and Qwen3-30B-A3B, where SA consistently outperforms SWA and matches or exceeds Mixture of Block Attention at comparable compute budgets. These results suggest that connectome-inspired stochastic routing is a practical primitive for improving the expressivity of efficient attention, complementary to existing linear and sparse approaches.

Paper Structure

This paper contains 26 sections, 3 theorems, 19 equations, 8 figures, 5 tables, 2 algorithms.

Key Result

Proposition 1

For a uniform random permutation $\sigma \sim \mathrm{Uniform}(\mathcal{S}_n)$ and any fixed pair $(i,j)$ with $i \neq j$, $\blacktriangleleft$$\blacktriangleleft$

Figures (8)

  • Figure 1: Overview of Stochastic Attention (SA). (a) A standard SWA Transformer layer. (b) The fruit fly whole-brain connectome: the adjacency matrix, shown after Reverse Cuthill--McKee reordering to expose block structure, lacks clear diagonal blocks, indicating that connections are broadly distributed across brain regions rather than confined to local modules. (c) An SA layer: token sequences are randomly permuted before windowed attention and restored afterward, producing stochastic long-range shortcuts analogous to the cross-regional connections in (b).
  • Figure 2: Left: Receptive field coverage as a function of depth ($n{=}2048$, $w{=}32$). SA achieves full sequence coverage in $O(\log_w n)$ layers via exponential growth, while SWA requires $O(n/w)$ layers with linear growth. Right: Computational cost scaling with sequence length ($w{=}256$). Both SA and SWA maintain $O(nw)$ linear scaling, while full attention grows quadratically.
  • Figure 3: Attention weight visualization (Layer 11, Head 0) on a 27-token sequence with window size $w{=}8$. Gray regions are masked (structurally invisible). Blue intensity indicates attention weight. Full Attention exhibits the complete lower-triangular pattern. SWA shows a strict diagonal band with all out-of-window positions masked. Stochastic Attention introduces scattered non-zero entries beyond the diagonal band. These are distant tokens that became local neighbors after random permutation, enabling direct long-range information flow within the same $O(nw)$ budget. SA + SWA combines both patterns: the SWA path provides the dense diagonal band for local coherence, while the SA path adds stochastic long-range connections, with the learned gate adaptively balancing the two.
  • Figure 4: Average accuracy across 7 benchmarks as a function of effective window size for Qwen3-8B (left) and Qwen3-30B-A3B (right). Stochastic Attention (red) recovers the full-attention baseline (dashed gray) most rapidly as window size increases, consistently outpacing SWA (blue) and matching or exceeding MoBA (green) at comparable compute budgets.
  • Figure 5: Per-task accuracy vs. window size on Qwen3-8B for four representative benchmarks.
  • ...and 3 more figures

Theorems & Definitions (7)

  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • proof