Table of Contents
Fetching ...

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Mingkuan Zhao, Wentao Hu, Jiayin Wang, Xin Lai, Tianchen Huang, Yuheng Min, Rui Yan, Xiaoyan Zhu

TL;DR

This work tackles the inefficiency of multi-head self-attention in Transformers by introducing SPAttention, a structured sparse attention mechanism that partitions the attention distance spectrum into balanced, non-overlapping bands assigned to each head. By enforcing completeness, exclusivity, and balance, SPAttention converts $H$ redundant $O(N^2)$ computations into a single $O(N^2)$ computation, achieving substantial throughput gains while preserving or improving performance on diverse benchmarks. The authors provide formal definitions, theoretical efficiency and regularization analyses, and extensive experiments showing roughly 2x training throughput and competitive or superior task performance across model scales up to 7B in the OLMoE framework, as well as favorable comparisons to leading sparse-attention methods. The results suggest that principled, hardware-friendly structural sparsity can break the traditional speed-performance trade-off and guide next-generation efficient LLM architectures.

Abstract

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H N^2) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N^2) computations into a single, collaborative O(N^2) computation, fundamentally reducing complexity by a factor of H. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

TL;DR

This work tackles the inefficiency of multi-head self-attention in Transformers by introducing SPAttention, a structured sparse attention mechanism that partitions the attention distance spectrum into balanced, non-overlapping bands assigned to each head. By enforcing completeness, exclusivity, and balance, SPAttention converts redundant computations into a single computation, achieving substantial throughput gains while preserving or improving performance on diverse benchmarks. The authors provide formal definitions, theoretical efficiency and regularization analyses, and extensive experiments showing roughly 2x training throughput and competitive or superior task performance across model scales up to 7B in the OLMoE framework, as well as favorable comparisons to leading sparse-attention methods. The results suggest that principled, hardware-friendly structural sparsity can break the traditional speed-performance trade-off and guide next-generation efficient LLM architectures.

Abstract

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H N^2) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N^2) computations into a single, collaborative O(N^2) computation, fundamentally reducing complexity by a factor of H. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.

Paper Structure

This paper contains 16 sections, 7 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: An illustration of the SPAttention sparse patterns for a sequence of length $N=1024$ with $H=8$ heads. Each subplot shows the attention pattern for an individual head. The entire causal attention distance spectrum is partitioned into eight contiguous, non-overlapping bands through Balanced Distance Partitioning, with each head assigned to exactly one band of width $\lfloor N/H \rfloor$ or $\lceil N/H \rceil$. This design guarantees complete, gapless information coverage while compelling different heads to specialize on distinct distance ranges—from immediate neighbors (Head 0) to long-range dependencies (Head 7).
  • Figure 2: Visualization of the sparse attention patterns for the three ablation variants (H=8, N=1024, showing one representative head for each). From left to right: (a) Sliding Window, representing classical sparse attention where all heads are restricted to a local window with size adapted to sequence length. (b) Exclusive Bands (EBALL), where local sharing is removed and each head is assigned a unique, non-overlapping distance band. (c) Gapped Bands (GBHALF), where systematic information blind spots are created by only attending to the first half of each head's assigned region.