Table of Contents
Fetching ...

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci

TL;DR

Tactic tackles the decode-time bottleneck in long-context LLMs by replacing fixed token budgets with a sparsity-adaptive mechanism that targets a cumulative attention score $P$. It combines clustering-based sorting of Key-vectors with distribution fitting of attention scores to efficiently identify a minimal token subset that achieves the desired attention coverage, while supporting Grouped Query Attention and leveraging FlashInfer for fast execution. The approach yields strong accuracy, with tighter attention-distance bounds, and substantial speedups: up to $7.29\times$ decode speedup and $1.58\times$ end-to-end, across multiple models and long-context benchmarks. This work provides a practical, calibration-free path to scalable long-context inference in accuracy-sensitive applications by adapting token budgets to the actual sparsity patterns of attention. The combination of intrinsic sparsity analysis, adaptive token budgeting, and low-overhead clustering-based sorting makes Tactic broadly applicable to real-world long-context inference scenarios.

Abstract

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.

Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs

TL;DR

Tactic tackles the decode-time bottleneck in long-context LLMs by replacing fixed token budgets with a sparsity-adaptive mechanism that targets a cumulative attention score . It combines clustering-based sorting of Key-vectors with distribution fitting of attention scores to efficiently identify a minimal token subset that achieves the desired attention coverage, while supporting Grouped Query Attention and leveraging FlashInfer for fast execution. The approach yields strong accuracy, with tighter attention-distance bounds, and substantial speedups: up to decode speedup and end-to-end, across multiple models and long-context benchmarks. This work provides a practical, calibration-free path to scalable long-context inference in accuracy-sensitive applications by adapting token budgets to the actual sparsity patterns of attention. The combination of intrinsic sparsity analysis, adaptive token budgeting, and low-overhead clustering-based sorting makes Tactic broadly applicable to real-world long-context inference scenarios.

Abstract

Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.

Paper Structure

This paper contains 32 sections, 8 equations, 12 figures, 6 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison between fixed-budget-based methods and Tactic. Fixed-budget-based methods may select excessive tokens or have a large difference from full attention score. In contrast, Tactic dynamically selects tokens to efficiently approximate full attention based on a cumulative attention score, considering variation of sparsity across different query tokens and contexts.
  • Figure 2: The distribution of $||V||$ across different layers, heads, and decoding tokens. The results indicate that $||V||$ values are concentrated within a very narrow range.
  • Figure 3: Comparison of KL-Divergence with attention distance (a) and its relation with downstream task scores (b).
  • Figure 4: Variation in sparsity across attention heads (a), model layers (b), and query tokens (c).
  • Figure 5: Distance of attention output to full attention of Quest tang2024questqueryawaresparsityefficient and setting cumulative attention score threshold P, measured with Llama3.1-8B-Instruct model. Each dot represents the distance of one attention computation identified by (head index, layer index, decode step).
  • ...and 7 more figures