Table of Contents
Fetching ...

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Lida Chen, Dong Xu, Chenxin An, Xintao Wang, Yikai Zhang, Jiangjie Chen, Zujie Liang, Feng Wei, Jiaqing Liang, Yanghua Xiao, Wei Wang

TL;DR

The paper tackles the inefficiency of quadratic attention in long-context LLMs by developing a theoretically grounded sparse attention design, PowerAttention, that achieves exponential receptive-field growth across layers while keeping per-token out-degree to $O(\log n)$. By modeling attention as a DAG and focusing on reachability, it demonstrates that existing static and dynamic patterns fail to provide complete coverage or scalable growth. PowerAttention connects tokens at power-of-two distances, ensuring complete sequence coverage within $2^d$ distance in $d$ layers, and is shown to outperform static sparse patterns on long-range tasks (Passkey Retrieval, RULER) with competitive or superior efficiency. Extensive experiments, including perplexity, retrieval-based evaluation, and efficiency metrics, establish PowerAttention as a practical, scalable solution for processing ultra-long sequences in LLMs. The work also provides probing insights into inter-layer information flow, supporting the design principle that targeted, exponentially expanding receptive fields can unlock long-context capabilities with modest computational overhead.

Abstract

Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

TL;DR

The paper tackles the inefficiency of quadratic attention in long-context LLMs by developing a theoretically grounded sparse attention design, PowerAttention, that achieves exponential receptive-field growth across layers while keeping per-token out-degree to . By modeling attention as a DAG and focusing on reachability, it demonstrates that existing static and dynamic patterns fail to provide complete coverage or scalable growth. PowerAttention connects tokens at power-of-two distances, ensuring complete sequence coverage within distance in layers, and is shown to outperform static sparse patterns on long-range tasks (Passkey Retrieval, RULER) with competitive or superior efficiency. Extensive experiments, including perplexity, retrieval-based evaluation, and efficiency metrics, establish PowerAttention as a practical, scalable solution for processing ultra-long sequences in LLMs. The work also provides probing insights into inter-layer information flow, supporting the design principle that targeted, exponentially expanding receptive fields can unlock long-context capabilities with modest computational overhead.

Abstract

Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in -layer LLMs, allowing each output token to attend to tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by , especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ( faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.

Paper Structure

This paper contains 27 sections, 1 theorem, 4 equations, 9 figures, 2 tables, 5 algorithms.

Key Result

Theorem 2.1

For a directed acyclic graph (DAG) with vertices labeled from 1 to n, let the edge set be Then the following properties hold:

Figures (9)

  • Figure 1: Layer-wise receptive field analysis of sparse attention patterns. (a) illustrates the information flow across six layers with a simplified 128-block example, while (b) presents the quantitative evaluation on Qwen2-7B with 32K context length. The actual token retrieval capability closely matches the theoretical receptive field growth for both patterns. Within the maximum information propagation depth, PowerAttention's exponential growth in receptive field leads to significantly higher accuracy compared to sliding window's linear expansion. Detailed implementation is provided in Appendix \ref{['sec:appendix_retrieval_evaluation']}.
  • Figure 2: (I) Modeling Attention Patterns as DAG: the attention mask serves as the adjacency matrix of a DAG, where edges represent token connections across layers, and the shortest path length indicates the minimum number of layers required for information flow between tokens. (II) Receptive Field Analysis for Sparse Attention Patterns: white lines show the shortest path to reach passkey tokens, with path length complexity $O(f(N))$ for distance $N$ and coverage indicating token accessibility.
  • Figure 3: Results on passkey retrieval with different attention patterns: (a) evaluation on context lengths up to 32k, and (b) comparison between stride slash attention and PowerAttention on extended context lengths up to 64k.
  • Figure 4: Efficiency evaluation results on Qwen2-7B with a NVIDIA A800 GPU.
  • Figure 5: Information flow probing result for various attention mechanisms on the 28-layer Qwen2-7B with 16K context length. The sequence is divided into 64 blocks (0 is at the beginning and 63 is at the end), each of which contains 256 tokens, as detailed in Appendix \ref{['sec:appendix_probing']}. Each pixel represents the strength of passkey information at a specific layer and block position; a brighter pixel indicates a higher possibility of extracting passkey information from that position. The classification accuracy of the final token block in the last layer is highlighted, as it directly determines the output token.
  • ...and 4 more figures

Theorems & Definitions (2)

  • Theorem 2.1
  • proof