PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Lida Chen; Dong Xu; Chenxin An; Xintao Wang; Yikai Zhang; Jiangjie Chen; Zujie Liang; Feng Wei; Jiaqing Liang; Yanghua Xiao; Wei Wang

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

Lida Chen, Dong Xu, Chenxin An, Xintao Wang, Yikai Zhang, Jiangjie Chen, Zujie Liang, Feng Wei, Jiaqing Liang, Yanghua Xiao, Wei Wang

TL;DR

The paper tackles the inefficiency of quadratic attention in long-context LLMs by developing a theoretically grounded sparse attention design, PowerAttention, that achieves exponential receptive-field growth across layers while keeping per-token out-degree to $O(\log n)$. By modeling attention as a DAG and focusing on reachability, it demonstrates that existing static and dynamic patterns fail to provide complete coverage or scalable growth. PowerAttention connects tokens at power-of-two distances, ensuring complete sequence coverage within $2^d$ distance in $d$ layers, and is shown to outperform static sparse patterns on long-range tasks (Passkey Retrieval, RULER) with competitive or superior efficiency. Extensive experiments, including perplexity, retrieval-based evaluation, and efficiency metrics, establish PowerAttention as a practical, scalable solution for processing ultra-long sequences in LLMs. The work also provides probing insights into inter-layer information flow, supporting the design principle that targeted, exponentially expanding receptive fields can unlock long-context capabilities with modest computational overhead.

Abstract

Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts. Sparse attention methods offer a promising solution, but existing approaches often suffer from incomplete effective context and/or require complex implementation of pipeline. We present a comprehensive analysis of sparse attention for autoregressive LLMs from the respective of receptive field, recognize the suboptimal nature of existing methods for expanding the receptive field, and introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension through the theoretical analysis. PowerAttention achieves exponential receptive field growth in $d$-layer LLMs, allowing each output token to attend to $2^d$ tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5\sim 40\%$, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention ($3.0\times$ faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

TL;DR

. By modeling attention as a DAG and focusing on reachability, it demonstrates that existing static and dynamic patterns fail to provide complete coverage or scalable growth. PowerAttention connects tokens at power-of-two distances, ensuring complete sequence coverage within

distance in

layers, and is shown to outperform static sparse patterns on long-range tasks (Passkey Retrieval, RULER) with competitive or superior efficiency. Extensive experiments, including perplexity, retrieval-based evaluation, and efficiency metrics, establish PowerAttention as a practical, scalable solution for processing ultra-long sequences in LLMs. The work also provides probing insights into inter-layer information flow, supporting the design principle that targeted, exponentially expanding receptive fields can unlock long-context capabilities with modest computational overhead.

Abstract

-layer LLMs, allowing each output token to attend to

tokens, ensuring completeness and continuity of the receptive field. Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by

, especially on tasks demanding long-range dependencies like Passkey Retrieval and RULER, while maintaining a comparable time complexity to sliding window attention. Efficiency evaluations further highlight PowerAttention's superior speedup in both prefilling and decoding phases compared with dynamic sparse attentions and full attention (

faster on 128K context), making it a highly effective and user-friendly solution for processing long sequences in LLMs.

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

TL;DR

Abstract

PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (2)