Table of Contents
Fetching ...

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

Qianchao Zhu, Jiangfei Duan, Chang Chen, Siran Liu, Guanyu Feng, Xin Lv, Xiao Chuanfu, Dahua Lin, Chao Yang

TL;DR

This work tackles the bottleneck of quadratic attention in long-context LLM inference by introducing SampleAttention, a runtime-adaptive structured sparse attention mechanism. It leverages a two-stage pipeline—query-guided chunked sampling and score-based key-value filtering—guided by the Cumulative Residual Attention ($CRA$) metric to select a minimal, yet effective, set of key-value blocks across head-specific patterns. The approach yields near-lossless accuracy while significantly reducing Time-to-First-Token (TTFT), achieving up to $5.29\times$ TTFT reduction in extreme long-context scenarios and establishing a new Pareto frontier over prior sparse attention methods. Hardware-aware kernel optimizations further enhance practicality, and the method remains compatible with existing KV-cache eviction techniques, enabling scalable deployment in real-world long-context LLM tasks.

Abstract

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to $2.42\times$ compared with FlashAttention.

SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention

TL;DR

This work tackles the bottleneck of quadratic attention in long-context LLM inference by introducing SampleAttention, a runtime-adaptive structured sparse attention mechanism. It leverages a two-stage pipeline—query-guided chunked sampling and score-based key-value filtering—guided by the Cumulative Residual Attention () metric to select a minimal, yet effective, set of key-value blocks across head-specific patterns. The approach yields near-lossless accuracy while significantly reducing Time-to-First-Token (TTFT), achieving up to TTFT reduction in extreme long-context scenarios and establishing a new Pareto frontier over prior sparse attention methods. Hardware-aware kernel optimizations further enhance practicality, and the method remains compatible with existing KV-cache eviction techniques, enabling scalable deployment in real-world long-context LLM tasks.

Abstract

Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to compared with FlashAttention.
Paper Structure (26 sections, 3 equations, 13 figures, 6 tables, 1 algorithm)

This paper contains 26 sections, 3 equations, 13 figures, 6 tables, 1 algorithm.

Figures (13)

  • Figure 1: Compared to previous static and dynamic sparse attention methods, SampleAttention captures adaptive structured sparse patterns for each head. It achieves a significant reduction in TTFT compared to FlashAttention2.
  • Figure 2: The average sparsity ratio of three different models with long-context window on tasks with varying length ranges.
  • Figure 3: The sparsity ratio of ChatGLM3 (28 layers$\times$32 heads) and InternLM2 (32 layers$\times$32 heads), evaluated over different tasks during prefill. The sparsity ratio varies across different attention heads, input contents and model architectures.
  • Figure 4: The visualization of attention reveals diverse structured sparse patterns. Different heads with same prompt exhibit dynamic sparse indices and ratios, but the patterns can generally be categorized into (a) column , (b) slash , or (c) composed pattern with column and slash. These structured patterns generally extend across the entire head. However, in some heads, such as (c), a prominent column structure in the upper half gradually fades in the lower half. Additionally, as shown in (d), the same head exhibits significant pattern differences under different prompts, highlighting the content-aware nature.
  • Figure 5: The curves of model accuracy and attention recall with changing CRA thresholds $\alpha$ for different tasks.
  • ...and 8 more figures