Tactic: Adaptive Sparse Attention with Clustering and Distribution Fitting for Long-Context LLMs
Kan Zhu, Tian Tang, Qinyu Xu, Yile Gu, Zhichen Zeng, Rohan Kadekodi, Liangyu Zhao, Ang Li, Arvind Krishnamurthy, Baris Kasikci
TL;DR
Tactic tackles the decode-time bottleneck in long-context LLMs by replacing fixed token budgets with a sparsity-adaptive mechanism that targets a cumulative attention score $P$. It combines clustering-based sorting of Key-vectors with distribution fitting of attention scores to efficiently identify a minimal token subset that achieves the desired attention coverage, while supporting Grouped Query Attention and leveraging FlashInfer for fast execution. The approach yields strong accuracy, with tighter attention-distance bounds, and substantial speedups: up to $7.29\times$ decode speedup and $1.58\times$ end-to-end, across multiple models and long-context benchmarks. This work provides a practical, calibration-free path to scalable long-context inference in accuracy-sensitive applications by adapting token budgets to the actual sparsity patterns of attention. The combination of intrinsic sparsity analysis, adaptive token budgeting, and low-overhead clustering-based sorting makes Tactic broadly applicable to real-world long-context inference scenarios.
Abstract
Long-context models are essential for many applications but face inefficiencies in loading large KV caches during decoding. Prior methods enforce fixed token budgets for sparse attention, assuming a set number of tokens can approximate full attention. However, these methods overlook variations in the importance of attention across heads, layers, and contexts. To address these limitations, we propose Tactic, a sparsity-adaptive and calibration-free sparse attention mechanism that dynamically selects tokens based on their cumulative attention scores rather than a fixed token budget. By setting a target fraction of total attention scores, Tactic ensures that token selection naturally adapts to variations in attention sparsity. To efficiently approximate this selection, Tactic leverages clustering-based sorting and distribution fitting, allowing it to accurately estimate token importance with minimal computational overhead. We show that Tactic outperforms existing sparse attention algorithms, achieving superior accuracy and up to 7.29x decode attention speedup. This improvement translates to an overall 1.58x end-to-end inference speedup, making Tactic a practical and effective solution for long-context LLM inference in accuracy-sensitive applications.
