$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu; Yanxuan Yu

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

Dong Liu, Yanxuan Yu

TL;DR

PiAttention addresses the $O(n^2)$ cost of standard self-attention by combining ring-local attention with deterministic $\pi$-stride skips and a dynamic fusion gate to blend local and long-range context. The method achieves a receptive-field bound of $R(L) \le kL + \pi \lceil \log_2 L \rceil$ and maintains per-layer complexity $O(nk)$ with memory $O(nk + n)$, improving over RingAttention. Empirically, it matches or surpasses dense attention across language, retrieval, and vision-language tasks, delivering an $8.3\%$ perplexity improvement on WikiText-103 and $24.1\%$ fewer FLOPs while using fewer GPUs. Ablation results show the necessity of both periodic skips and adaptive fusion, with favorable throughput–quality trade-offs across various context lengths up to $32k$ tokens. These results demonstrate a practical, scalable pathway for long-context modeling with transformers.

Abstract

Transformers have revolutionized natural language processing, but their quadratic complexity with respect to sequence length remains a fundamental bottleneck for long-range modeling. While sparse attention mechanisms like RingAttention reduce computational costs by restricting attention to local neighborhoods, they suffer from limited receptive fields and lack of adaptability. We present \PiAttention, a periodic sparse Transformer that factorizes attention into ring-local neighborhoods, deterministic $π$-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves $\mathcal{O}(kL + π\log L)$ receptive field growth compared to $\mathcal{O}(kL)$ for RingAttention, where $k$ is the local window size, $π$ is the skip period, and $L$ is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

TL;DR

PiAttention addresses the

cost of standard self-attention by combining ring-local attention with deterministic

-stride skips and a dynamic fusion gate to blend local and long-range context. The method achieves a receptive-field bound of

and maintains per-layer complexity

with memory

, improving over RingAttention. Empirically, it matches or surpasses dense attention across language, retrieval, and vision-language tasks, delivering an

perplexity improvement on WikiText-103 and

fewer FLOPs while using fewer GPUs. Ablation results show the necessity of both periodic skips and adaptive fusion, with favorable throughput–quality trade-offs across various context lengths up to

tokens. These results demonstrate a practical, scalable pathway for long-context modeling with transformers.

Abstract

-stride skips, and an adaptive fusion gate. The periodic structure provides predictable coverage of distant tokens, while the sparse footprint keeps the per-layer complexity linear in context length. We prove that \PiAttention achieves

receptive field growth compared to

for RingAttention, where

is the local window size,

is the skip period, and

is the sequence length. Extensive experiments on language modeling, retrieval, and vision-language tasks demonstrate that \PiAttention matches or surpasses dense attention quality with 8.3\% lower perplexity than RingAttention while using 50\% fewer GPUs for the same context length. Our detailed ablations and visualizations reveal the importance of periodic skips, adaptive fusion, and head-level sparsity coordination for efficient long-context modeling.

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

TL;DR

Abstract

$π$-Attention: Periodic Sparse Transformers for Efficient Long-Context Modeling

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (6)