Table of Contents
Fetching ...

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

Siran Liu, Guoxia Wang, Sa Wang, Jinle Zeng, HaoYang Xie, Siyu Lou, JiaBin Yang, DianHai Yu, Haifeng Wang, Chao Yang

TL;DR

RRAttention tackles the $O(L^2)$ attention bottleneck in long-context LLMs by introducing a preprocessing-free dynamic sparse attention mechanism that preserves query independence and enables global pattern discovery through stride-level aggregation. The core approach, head round-robin sampling, rotates sampled query positions across attention heads within each stride, coupled with Top-$\tau$ block selection and a static protection of the last query block, achieving $O(L^2/S^2)$ complexity. Empirical results on HELMET (language) and Video-MME (multimodal) show RRAttention recovers over 99% of full-attention performance while visiting roughly half of the blocks, delivering up to $2.4\times$ speedups at 128K context and outperforming existing dynamic sparse methods. The work provides a practical, scalable solution for long-context inference with broad applicability and a solid foundation for future extensions in decoding-stage sparsification and training-time sparsity learning.

Abstract

The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from $O(L^2)$ to $O(L^2/S^2)$ and employs adaptive Top-$τ$ selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4$\times$ speedup at 128K context length and outperforming existing dynamic sparse attention methods.

RRAttention: Dynamic Block Sparse Attention via Per-Head Round-Robin Shifts for Long-Context Inference

TL;DR

RRAttention tackles the attention bottleneck in long-context LLMs by introducing a preprocessing-free dynamic sparse attention mechanism that preserves query independence and enables global pattern discovery through stride-level aggregation. The core approach, head round-robin sampling, rotates sampled query positions across attention heads within each stride, coupled with Top- block selection and a static protection of the last query block, achieving complexity. Empirical results on HELMET (language) and Video-MME (multimodal) show RRAttention recovers over 99% of full-attention performance while visiting roughly half of the blocks, delivering up to speedups at 128K context and outperforming existing dynamic sparse methods. The work provides a practical, scalable solution for long-context inference with broad applicability and a solid foundation for future extensions in decoding-stage sparsification and training-time sparsity learning.

Abstract

The quadratic complexity of attention mechanisms poses a critical bottleneck for large language models processing long contexts. While dynamic sparse attention methods offer input-adaptive efficiency, they face fundamental trade-offs: requiring preprocessing, lacking global evaluation, violating query independence, or incurring high computational overhead. We present RRAttention, a novel dynamic sparse attention method that simultaneously achieves all desirable properties through a head \underline{r}ound-\underline{r}obin (RR) sampling strategy. By rotating query sampling positions across attention heads within each stride, RRAttention maintains query independence while enabling efficient global pattern discovery with stride-level aggregation. Our method reduces complexity from to and employs adaptive Top- selection for optimal sparsity. Extensive experiments on natural language understanding (HELMET) and multimodal video comprehension (Video-MME) demonstrate that RRAttention recovers over 99\% of full attention performance while computing only half of the attention blocks, achieving 2.4 speedup at 128K context length and outperforming existing dynamic sparse attention methods.
Paper Structure (24 sections, 13 equations, 7 figures, 11 tables)

This paper contains 24 sections, 13 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Sparsity-accuracy trade-offs across different models and context lengths on HELMET.
  • Figure 2: Illustration of RRAttention. ①, ②, and ③ represent the three stages of our method. The example shows a configuration with stride size $S=4$ and block size $B=8$.
  • Figure 3: Runtime comparison of attention methods on LLaMA-3.1-8B-Instruct across different context lengths. (a): Attention overhead. (b): Pattern search time.
  • Figure 4: Attention pattern visualization at 16K context length. Each pair shows FullAttention ground truth (left) and RRAttention selection (right).
  • Figure 5: Attention pattern visualization at 32K context length. Each pair shows FullAttention ground truth (left) and RRAttention selection (right).
  • ...and 2 more figures