RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

Siran Liu; Guoxia Wang; Sa Wang; Jinle Zeng; HaoYang Xie; Siyu Lou; JiaBin Yang; DianHai Yu; Haifeng Wang; Chao Yang

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie, Siyu Lou, JiaBin Yang, DianHai Yu, Haifeng Wang, Chao Yang

TL;DR

RRAttention tackles the $O(L^2)$ attention bottleneck in long-context LLMs by introducing a preprocessing-free dynamic sparse attention mechanism that preserves query independence and enables global pattern discovery through stride-level aggregation. The core approach, head round-robin sampling, rotates sampled query positions across attention heads within each stride, coupled with Top-$\tau$ block selection and a static protection of the last query block, achieving $O(L^2/S^2)$ complexity. Empirical results on HELMET (language) and Video-MME (multimodal) show RRAttention recovers over 99% of full-attention performance while visiting roughly half of the blocks, delivering up to $2.4\times$ speedups at 128K context and outperforming existing dynamic sparse methods. The work provides a practical, scalable solution for long-context inference with broad applicability and a solid foundation for future extensions in decoding-stage sparsification and training-time sparsity learning.

Abstract

The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

TL;DR

RRAttention tackles the

attention bottleneck in long-context LLMs by introducing a preprocessing-free dynamic sparse attention mechanism that preserves query independence and enables global pattern discovery through stride-level aggregation. The core approach, head round-robin sampling, rotates sampled query positions across attention heads within each stride, coupled with Top-

block selection and a static protection of the last query block, achieving

complexity. Empirical results on HELMET (language) and Video-MME (multimodal) show RRAttention recovers over 99% of full-attention performance while visiting roughly half of the blocks, delivering up to

speedups at 128K context and outperforming existing dynamic sparse methods. The work provides a practical, scalable solution for long-context inference with broad applicability and a solid foundation for future extensions in decoding-stage sparsification and training-time sparsity learning.

Abstract

and employs adaptive Top-

selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4

speedup at 128K context length and outperforming existing dynamic sparse attention methods.

Paper Structure (24 sections, 13 equations, 7 figures, 11 tables)

This paper contains 24 sections, 13 equations, 7 figures, 11 tables.

Introduction
Preliminary and Background
Block-wise Sparse Attention
Multi-Dimensional Analysis of Attention Selection
Methodology: RRAttention
Query Sampling with Head Round-Robin Strategy
Stride-level Importance Estimation
Block-level Selection via Top-$\tau$ Thresholding
Experiments
Settings
Main Result
Ablation Study
Related Work
Training-based Sparse Methods
Inference-oriented Sparse Methods
...and 9 more sections

Figures (7)

Figure 1: Sparsity-accuracy trade-offs across different models and context lengths on HELMET.
Figure 2: Illustration of RRAttention. ①, ②, and ③ represent the three stages of our method. The example shows a configuration with stride size $S=4$ and block size $B=8$.
Figure 3: Runtime comparison of attention methods on LLaMA-3.1-8B-Instruct across different context lengths. (a): Attention overhead. (b): Pattern search time.
Figure 4: Attention pattern visualization at 16K context length. Each pair shows FullAttention ground truth (left) and RRAttention selection (right).
Figure 5: Attention pattern visualization at 32K context length. Each pair shows FullAttention ground truth (left) and RRAttention selection (right).
...and 2 more figures

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

TL;DR

Abstract

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (7)