RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference
Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie, Siyu Lou, JiaBin Yang, DianHai Yu, Haifeng Wang, Chao Yang
TL;DR
RRAttention tackles the $O(L^2)$ attention bottleneck in long-context LLMs by introducing a preprocessing-free dynamic sparse attention mechanism that preserves query independence and enables global pattern discovery through stride-level aggregation. The core approach, head round-robin sampling, rotates sampled query positions across attention heads within each stride, coupled with Top-$\tau$ block selection and a static protection of the last query block, achieving $O(L^2/S^2)$ complexity. Empirical results on HELMET (language) and Video-MME (multimodal) show RRAttention recovers over 99% of full-attention performance while visiting roughly half of the blocks, delivering up to $2.4\times$ speedups at 128K context and outperforming existing dynamic sparse methods. The work provides a practical, scalable solution for long-context inference with broad applicability and a solid foundation for future extensions in decoding-stage sparsification and training-time sparsity learning.
Abstract
The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.
