Table of Contents
Fetching ...

Scaling Linear Attention with Sparse State Expansion

Yuqi Pan, Yongqi An, Zheng Li, Yuhong Chou, Ruijie Zhu, Xiaohui Wang, Mingxuan Wang, Jinqiao Wang, Guoqi Li

TL;DR

This work tackles the inefficiency of Transformer long-context processing by introducing a row-sparse state update mechanism and Sparse State Expansion (SSE) to extend state capacity without inflating parameters. By interpreting state updates as information classification and employing top-$k$ softmax-based selection, the model achieves more discriminative, sparse representations and larger receptive fields. SSE further expands state capacity into partitioned slots with shared parameters and write-read gating, enabling scalable long-context modeling. Evaluations across language modeling, in-context retrieval, and mathematical reasoning show SSE and its hybrid SSE-H variant outperform existing linear attention models and match or surpass Transformer baselines on key tasks, with notable gains in AIME-style reasoning at moderate scales. These results position SSE as a practical, efficient architecture for high-fidelity long-context modeling and reasoning.

Abstract

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top-$k$ hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Supported by efficient parallelized implementations, our design achieves effective classification and highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

Scaling Linear Attention with Sparse State Expansion

TL;DR

This work tackles the inefficiency of Transformer long-context processing by introducing a row-sparse state update mechanism and Sparse State Expansion (SSE) to extend state capacity without inflating parameters. By interpreting state updates as information classification and employing top- softmax-based selection, the model achieves more discriminative, sparse representations and larger receptive fields. SSE further expands state capacity into partitioned slots with shared parameters and write-read gating, enabling scalable long-context modeling. Evaluations across language modeling, in-context retrieval, and mathematical reasoning show SSE and its hybrid SSE-H variant outperform existing linear attention models and match or surpass Transformer baselines on key tasks, with notable gains in AIME-style reasoning at moderate scales. These results position SSE as a practical, efficient architecture for high-fidelity long-context modeling and reasoning.

Abstract

The Transformer architecture, despite its widespread success, struggles with long-context scenarios due to quadratic computation and linear memory growth. While various linear attention variants mitigate these efficiency constraints by compressing context into fixed-size states, they often degrade performance in tasks such as in-context retrieval and reasoning. To address this limitation and achieve more effective context compression, we propose two key innovations. First, we introduce a row-sparse update formulation for linear attention by conceptualizing state updating as information classification. This enables sparse state updates via softmax-based top- hard classification, thereby extending receptive fields and reducing inter-class interference. Second, we present Sparse State Expansion (SSE) within the sparse framework, which expands the contextual state into multiple partitions, effectively decoupling parameter size from state capacity while maintaining the sparse classification paradigm. Supported by efficient parallelized implementations, our design achieves effective classification and highly discriminative state representations. We extensively validate SSE in both pure linear and hybrid (SSE-H) architectures across language modeling, in-context retrieval, and mathematical reasoning benchmarks. SSE demonstrates strong retrieval performance and scales favorably with state size. Moreover, after reinforcement learning (RL) training, our 2B SSE-H model achieves state-of-the-art mathematical reasoning performance among small reasoning models, scoring 64.5 on AIME24 and 50.2 on AIME25, significantly outperforming similarly sized open-source Transformers. These results highlight SSE as a promising and efficient architecture for long-context modeling.

Paper Structure

This paper contains 19 sections, 4 theorems, 26 equations, 10 figures, 11 tables, 2 algorithms.

Key Result

Proposition 1

For inputs $\mathbf{x}_s$ satisfying $||\mathbf{x}_s||^2=d$, and given the classification rule $C^i=\{\mathbf{x}_s|\mathbf{x}_s \mathbf{W}_k^i > k_{th}^i\}$, inputs belonging to the same class (row) $C^i$ satisfy $\mathbf{x}_r\mathbf{x}_s^{\top}>d\cos{2\theta^i}$, where $\theta^i=\arccos(\frac{k_{th

Figures (10)

  • Figure 1: Left: Benchmark performance comparison in mathematical reasoning. Right: State scaling performance. $n$ represents the number of expanded state partitions, and $k$ denotes the top-$k$ selection size.
  • Figure 2: Clustering of information within linear attention state rows. We observe that learned state representations reveal clear clustering patterns. Specifically, we assign each token's value vector (represented as a point) to a specific state row (indicated by color) by taking the maximum activation over its corresponding key vector. This assignment demonstrates that information within the same row tends to share similar feature representations.
  • Figure 3: Row-wise cosine similarity of contextual states in linear attention models with varying classifier designs. The figure presents the $128 \times 128$ similarity matrices between state rows, where darker blue indicates lower similarity, reflecting more effective classification.
  • Figure 4: Comparison between vanilla linear attention and SSE. SSE expands the state into $N$ partitions within the row-sparse update framework, where a classification function assigns information to specific state rows. All partitions share attention parameters. Sparse row selection follows two steps: (1) top-$k$ partition selection based on a write-read gate (blue indicates selected partitions; green marks an always-selected partition for training stability), and (2) row selection within the chosen partitions via softmax over key vectors.
  • Figure 5: Singular value entropy of the contextual states. SSE with diagonal gating exhibits higher singular value entropy than GLA, indicating a less compressible and more effectively utilized state composition.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Proof 1
  • Definition 1
  • Proposition 2
  • Proof 2
  • Definition 2
  • Proposition 3
  • Proof 3
  • Definition 3
  • Definition 4
  • ...and 2 more