vAttention: Verified Sparse Attention
Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica
TL;DR
vAttention tackles the decoding-time bottleneck of Scaled Dot Product Attention in long-context settings by delivering a verified sparse-attention mechanism with user-specified $ (\epsilon, \delta) $ guarantees. It blends deterministic heavy-hitter selection (sinks, local windows, and predicted top-$k$ tokens) with uniform random sampling of residual tokens, and derives a CLT-based budget to ensure $ \|\hat{N}/\hat{D} - N/D\|_2 \le \epsilon \|N/D\|_2$ with probability $1-\delta$, while using a denominator-only relaxation for efficiency. Empirically, vAttention yields substantial quality gains over prior sparse-attention methods, enables full-model-quality long-generation at high sparsity (up to ~20x), and delivers practical speedups, supporting reliable deployment in real-world decoding. By providing explicit per-head accuracy guarantees and a tunable quality–efficiency trade-off, the approach paves the way for robust, scalable sparse attention with open-source code.
Abstract
State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.
