Table of Contents
Fetching ...

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

Yu Zhang, Dong Guo, Fang Wu, Guoliang Zhu, Dian Ding, Yiming Zhang

TL;DR

AnchorAttention tackles the prefill bottleneck in long-context LLMs by introducing a difference-aware, stripe-granularity sparse attention mechanism. It combines pattern-based anchor computation, difference-aware stripe sparsity identification, and fine-grained discrete KV loading, implemented at the kernel level to maximize parallelism without extra memory. The method achieves higher recall and substantially faster prefill attention, reporting a 1.44x speedup at 128k context versus the previous state-of-the-art and up to 4.6x versus full KV attention, while preserving accuracy across diverse long-context benchmarks. This approach enables efficient, globally informed attention with finer sparsity than block-based methods, offering practical impact for scalable inference in ultra-long contexts.

Abstract

Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic pattern matching and block-sparse low-level implementations. However, their reliance on local information for pattern identification fails to capture global contexts, and the coarse granularity of blocks leads to persistent internal sparsity, resulting in suboptimal accuracy and efficiency. To address these limitations, we propose \textbf{AnchorAttention}, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions at a finer stripe granularity while adapting to global contextual information, achieving superior speed and accuracy. AnchorAttention comprises three key components: (1) \textbf{Pattern-based Anchor Computation}, leveraging the commonalities present across all inputs to rapidly compute a set of near-maximum scores as the anchor; (2) \textbf{Difference-aware Stripe Sparsity Identification}, performing difference-aware comparisons with the anchor to quickly obtain discrete coordinates of significant regions in a stripe-like sparsity pattern; (3) \textbf{Fine-grained Sparse Computation}, replacing the traditional contiguous KV block loading approach with simultaneous discrete KV position loading to maximize sparsity rates while preserving full hardware computational potential. With its finer-grained sparsity strategy, \textbf{AnchorAttention} achieves higher sparsity rates at the same recall level, significantly reducing computation time. Compared to previous state-of-the-art methods, at a text length of 128k, it achieves a speedup of 1.44$\times$ while maintaining higher recall rates.

AnchorAttention: Difference-Aware Sparse Attention with Stripe Granularity

TL;DR

AnchorAttention tackles the prefill bottleneck in long-context LLMs by introducing a difference-aware, stripe-granularity sparse attention mechanism. It combines pattern-based anchor computation, difference-aware stripe sparsity identification, and fine-grained discrete KV loading, implemented at the kernel level to maximize parallelism without extra memory. The method achieves higher recall and substantially faster prefill attention, reporting a 1.44x speedup at 128k context versus the previous state-of-the-art and up to 4.6x versus full KV attention, while preserving accuracy across diverse long-context benchmarks. This approach enables efficient, globally informed attention with finer sparsity than block-based methods, offering practical impact for scalable inference in ultra-long contexts.

Abstract

Large Language Models (LLMs) with extended context lengths face significant computational challenges during the pre-filling phase, primarily due to the quadratic complexity of self-attention. Existing methods typically employ dynamic pattern matching and block-sparse low-level implementations. However, their reliance on local information for pattern identification fails to capture global contexts, and the coarse granularity of blocks leads to persistent internal sparsity, resulting in suboptimal accuracy and efficiency. To address these limitations, we propose \textbf{AnchorAttention}, a difference-aware, dynamic sparse attention mechanism that efficiently identifies critical attention regions at a finer stripe granularity while adapting to global contextual information, achieving superior speed and accuracy. AnchorAttention comprises three key components: (1) \textbf{Pattern-based Anchor Computation}, leveraging the commonalities present across all inputs to rapidly compute a set of near-maximum scores as the anchor; (2) \textbf{Difference-aware Stripe Sparsity Identification}, performing difference-aware comparisons with the anchor to quickly obtain discrete coordinates of significant regions in a stripe-like sparsity pattern; (3) \textbf{Fine-grained Sparse Computation}, replacing the traditional contiguous KV block loading approach with simultaneous discrete KV position loading to maximize sparsity rates while preserving full hardware computational potential. With its finer-grained sparsity strategy, \textbf{AnchorAttention} achieves higher sparsity rates at the same recall level, significantly reducing computation time. Compared to previous state-of-the-art methods, at a text length of 128k, it achieves a speedup of 1.44 while maintaining higher recall rates.

Paper Structure

This paper contains 25 sections, 7 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: (a) Block-sparse pattern, with yellow regions indicating computed blocks; (b) Stripe-sparse pattern, with red regions showing computed areas, enabling higher sparsity by loading non-contiguous positions across multiple blocks.
  • Figure 2: Acceleration of attention computation compared to FlashAttention.
  • Figure 3: (a) Heatmaps vary significantly across different inputs. (b) Stripe sparse appears in specific attention maps, demonstrating that local information fails to capture the full attention distribution.
  • Figure 4: Recall heatmaps of Sparsity Strategies using LLaMA-3.1-8B on the 128k Rulerhsieh2024rulerwhatsrealcontext dataset, with average sparsity rates of 93.7% (a), 96.4% (b), and 94.1% (c). Per-head sparsity rates are detailed in Appendix \ref{['app:Difference_method_sparisty']}. Recall is defined as the percentage of attention values that are numerically equal between the current sparse attention and the full attentionjiang2024minference10acceleratingprefilling.
  • Figure 5: The distribution of maximum attention scores highlights the dominance of anchor positions.
  • ...and 5 more figures