Table of Contents
Fetching ...

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu, Yanxuan Yu

TL;DR

PiAttention addresses the $O(n^2)$ cost of standard self-attention by combining ring-local attention with deterministic $\pi$-stride skips and a dynamic fusion gate to blend local and long-range context. The method achieves a receptive-field bound of $R(L) \le kL + \pi \lceil \log_2 L \rceil$ and maintains per-layer complexity $O(nk)$ with memory $O(nk + n)$, improving over RingAttention. Empirically, it matches or surpasses dense attention across language, retrieval, and vision-language tasks, delivering an $8.3\%$ perplexity improvement on WikiText-103 and $24.1\%$ fewer FLOPs while using fewer GPUs. Ablation results show the necessity of both periodic skips and adaptive fusion, with favorable throughput–quality trade-offs across various context lengths up to $32k$ tokens. These results demonstrate a practical, scalable pathway for long-context modeling with transformers.

Abstract

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

TL;DR

PiAttention addresses the cost of standard self-attention by combining ring-local attention with deterministic -stride skips and a dynamic fusion gate to blend local and long-range context. The method achieves a receptive-field bound of and maintains per-layer complexity with memory , improving over RingAttention. Empirically, it matches or surpasses dense attention across language, retrieval, and vision-language tasks, delivering an perplexity improvement on WikiText-103 and fewer FLOPs while using fewer GPUs. Ablation results show the necessity of both periodic skips and adaptive fusion, with favorable throughput–quality trade-offs across various context lengths up to tokens. These results demonstrate a practical, scalable pathway for long-context modeling with transformers.

Abstract

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic -stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves receptive field growth compared to for RingAttention, where is the local window size, is the skip period, and is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

Paper Structure

This paper contains 26 sections, 3 theorems, 8 equations, 4 figures, 6 tables.

Key Result

Proposition 1

Consider causal $\pi$-Attention with $L$ layers, local radius $k$, skip period $\pi$, and self-attention restricted to neighbors $j \le i$ at every layer. Let $R(L)$ be the maximum number of tokens to the left of $i$ whose information can reach $i$ after $L$ layers. Under the propagation rule that e

Figures (4)

  • Figure 1: Conceptual illustration of $\pi$-Attention showing ring-local neighborhoods, periodic $\pi$-skip connections, and an adaptive fusion gate that balances local and skip contexts per token, enabling efficient long-range modeling with linear complexity.
  • Figure 2: Token interaction flow in $\pi$-Attention across different layers. Local ring attention dominates early layers while skip connections activate deeper layers, enabling efficient information propagation across the sequence.
  • Figure 3: Core $\pi$-Attention module highlighting the fusion of ring-local and periodic skip paths.
  • Figure 4: Transformer block using $\pi$-Attention.

Theorems & Definitions (6)

  • Proposition 1: Causal receptive-field upper bound
  • proof : Proof sketch
  • Theorem 1: Computational Complexity
  • proof
  • Theorem 2: Memory Complexity
  • proof