Table of Contents
Fetching ...

The Key to State Reduction in Linear Attention: A Rank-based Perspective

Philipp Nazari, T. Konstantin Rusch

TL;DR

This work analyzes why linear attention’s associative memory often operates at low effective rank and how this degrades retrieval under noise. It develops a rank-centric theory linking effective rank, rank utilization, and retrieval error, and then introduces a hardware-aware, axis-aligned pruning framework (including the DRRQR method) to reduce state size while preserving compatibility with existing depthwise convolutions and CUDA kernels. Empirically, pruning can remove about 50% of key and query channels with only modest perplexity increases and notable throughput gains, though recall-heavy tasks may still suffer without hybridizing with softmax attention. The results provide a principled path toward faster, memory-efficient linear-attention models and offer design guidance for future hybrid architectures that balance efficiency with retrieval performance.

Abstract

Linear attention offers a computationally efficient yet expressive alternative to softmax attention. However, recent empirical results indicate that the state of trained linear attention models often exhibits a low-rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise. In addition to these theoretical insights, we conjecture that the low-rank states can be substantially reduced post-training with only minimal performance degradation, yielding faster and more memory-efficient models. To this end, we propose a novel hardware-aware approach that structurally prunes key and query matrices, reducing the state size while retaining compatibility with existing CUDA kernels. We adapt several existing pruning strategies to fit our framework and, building on our theoretical analysis, propose a novel structured pruning method based on a rank-revealing QR decomposition. Our empirical results, evaluated across models of varying sizes and on various downstream tasks, demonstrate the effectiveness of our state reduction framework. We highlight that our framework enables the removal of 50% of the query and key channels at only a marginal increase in perplexity. The code for this project can be found at https://github.com/camail-official/LinearAttentionPruning.

The Key to State Reduction in Linear Attention: A Rank-based Perspective

TL;DR

This work analyzes why linear attention’s associative memory often operates at low effective rank and how this degrades retrieval under noise. It develops a rank-centric theory linking effective rank, rank utilization, and retrieval error, and then introduces a hardware-aware, axis-aligned pruning framework (including the DRRQR method) to reduce state size while preserving compatibility with existing depthwise convolutions and CUDA kernels. Empirically, pruning can remove about 50% of key and query channels with only modest perplexity increases and notable throughput gains, though recall-heavy tasks may still suffer without hybridizing with softmax attention. The results provide a principled path toward faster, memory-efficient linear-attention models and offer design guidance for future hybrid architectures that balance efficiency with retrieval performance.

Abstract

Linear attention offers a computationally efficient yet expressive alternative to softmax attention. However, recent empirical results indicate that the state of trained linear attention models often exhibits a low-rank structure, suggesting that these models underexploit their capacity in practice. To illuminate this phenomenon, we provide a theoretical analysis of the role of rank in linear attention, revealing that low effective rank can affect retrieval error by amplifying query noise. In addition to these theoretical insights, we conjecture that the low-rank states can be substantially reduced post-training with only minimal performance degradation, yielding faster and more memory-efficient models. To this end, we propose a novel hardware-aware approach that structurally prunes key and query matrices, reducing the state size while retaining compatibility with existing CUDA kernels. We adapt several existing pruning strategies to fit our framework and, building on our theoretical analysis, propose a novel structured pruning method based on a rank-revealing QR decomposition. Our empirical results, evaluated across models of varying sizes and on various downstream tasks, demonstrate the effectiveness of our state reduction framework. We highlight that our framework enables the removal of 50% of the query and key channels at only a marginal increase in perplexity. The code for this project can be found at https://github.com/camail-official/LinearAttentionPruning.
Paper Structure (60 sections, 14 theorems, 74 equations, 6 figures, 16 tables, 3 algorithms)

This paper contains 60 sections, 14 theorems, 74 equations, 6 figures, 16 tables, 3 algorithms.

Key Result

Proposition 2.2

Consider the linear attention recurrence $\mathbf{S}\xspace_t = \mathbf{S}\xspace_{t-1} + \mathbf{v}\xspace_t\mathbf{k}\xspace_t^\top$. There exists a scalar quantity $\nu(\mathbf{V}\xspace_t)$ such that the effective rank of the memory is lower bounded:

Figures (6)

  • Figure 1: Left: Unstructured pruning yields a sparse weight matrix $\mathbf{W}_{\mathbf{K}\xspace}$ yet preserves the column dimension of $\mathbf{K}$, leaving the domain of the state matrix $\mathbf{S}_t \in \mathbb{R}^{d_v, d_k}$ invariant. Right: Structured pruning eliminates basis vectors, mapping keys to a lower-dimensional space $\mathbb{R}^{d'_k}$ where $d'_k < d_k$. This results in a compressed state $\mathbf{S}_t \in \mathbb{R}^{d_v, d'_k}$. This reduction strictly decreases the FLOP count required to compute the recurrence. This figure is inspired by ashkboos2024slicegpt.
  • Figure 2: Rank utilization of DeltaNet 370M as a function of the token index for a random sample of Fineweb-Edu of length $1024$, averaged over layers and heads, at a compression ratio of 75% pre-RFT. Note how the strongest models exhibit the largest rank utilization. All models except for the baseline have the same maximum capacity and are thus directly comparable.
  • Figure 3: Singular value distribution of a DeltaNet 370M layer computed on a random Fineweb-Edu sample ($T=1024$). We compare the uncompressed Baseline against DRRQR and Grad at 75% compression (pre-RFT).
  • Figure 4: Impact of Recovery Fine-Tuning (RFT) on the singular value spectrum. We compare the uncompressed Baseline against DRRQR at 75% compression, both before (pre) and after (post) RFT.
  • Figure 5: Convolution filters for queries, keys, and values of a standard DeltaNet 370M (non-shared). High similarity within heads (separated by red lines) suggests implicit sharing.
  • ...and 1 more figures

Theorems & Definitions (22)

  • Definition 2.1: Effective Rank ipsen2025stable
  • Proposition 2.2
  • Definition 2.3: Rank Utilization
  • Corollary 2.4
  • Theorem 2.5: Effective Rank Governs Retrieval Error
  • Corollary 2.6: Expected Error Bounds
  • Proposition 3.1: Orthogonal Invariance of Sequence Mixing
  • Definition 3.2: Axis-Aligned Transformations
  • Proposition 3.3: Compatibility with Depthwise Convolutions
  • Proposition 2.1: Optimal Diagonal Adaptation
  • ...and 12 more