Table of Contents
Fetching ...

vAttention: Verified Sparse Attention

Aditya Desai, Kumar Krishna Agrawal, Shuo Yang, Alejandro Cuadron, Luis Gaspar Schroeder, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica

TL;DR

vAttention tackles the decoding-time bottleneck of Scaled Dot Product Attention in long-context settings by delivering a verified sparse-attention mechanism with user-specified $ (\epsilon, \delta) $ guarantees. It blends deterministic heavy-hitter selection (sinks, local windows, and predicted top-$k$ tokens) with uniform random sampling of residual tokens, and derives a CLT-based budget to ensure $ \|\hat{N}/\hat{D} - N/D\|_2 \le \epsilon \|N/D\|_2$ with probability $1-\delta$, while using a denominator-only relaxation for efficiency. Empirically, vAttention yields substantial quality gains over prior sparse-attention methods, enables full-model-quality long-generation at high sparsity (up to ~20x), and delivers practical speedups, supporting reliable deployment in real-world decoding. By providing explicit per-head accuracy guarantees and a tunable quality–efficiency trade-off, the approach paves the way for robust, scalable sparse attention with open-source code.

Abstract

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top-$k$ (and its extension, top-$p$) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top-$k$ and random sampling are complementary: top-$k$ performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified $(ε, δ)$ guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., $\sim$4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.

vAttention: Verified Sparse Attention

TL;DR

vAttention tackles the decoding-time bottleneck of Scaled Dot Product Attention in long-context settings by delivering a verified sparse-attention mechanism with user-specified guarantees. It blends deterministic heavy-hitter selection (sinks, local windows, and predicted top- tokens) with uniform random sampling of residual tokens, and derives a CLT-based budget to ensure with probability , while using a denominator-only relaxation for efficiency. Empirically, vAttention yields substantial quality gains over prior sparse-attention methods, enables full-model-quality long-generation at high sparsity (up to ~20x), and delivers practical speedups, supporting reliable deployment in real-world decoding. By providing explicit per-head accuracy guarantees and a tunable quality–efficiency trade-off, the approach paves the way for robust, scalable sparse attention with open-source code.

Abstract

State-of-the-art sparse attention methods for reducing decoding latency fall into two main categories: approximate top- (and its extension, top-) and recently introduced sampling-based estimation. However, these approaches are fundamentally limited in their ability to approximate full attention: they fail to provide consistent approximations across heads and query vectors and, most critically, lack guarantees on approximation quality, limiting their practical deployment. We observe that top- and random sampling are complementary: top- performs well when attention scores are dominated by a few tokens, whereas random sampling provides better estimates when attention scores are relatively uniform. Building on this insight and leveraging the statistical guarantees of sampling, we introduce vAttention, the first practical sparse attention mechanism with user-specified guarantees on approximation accuracy (thus, verified). These guarantees make vAttention a compelling step toward practical, reliable deployment of sparse attention at scale. By unifying top-k and sampling, vAttention outperforms both individually, delivering a superior quality-efficiency trade-off. Our experiments show that vAttention significantly improves the quality of sparse attention (e.g., 4.5 percentage points for Llama-3.1-8B-Inst and Deepseek-R1-Distill-Llama-8B on RULER-HARD), and effectively bridges the gap between full and sparse attention (e.g., across datasets, it matches full model quality with upto 20x sparsity). We also demonstrate that it can be deployed in reasoning scenarios to achieve fast decoding without compromising model quality (e.g., vAttention achieves full model quality on AIME2024 at 10x sparsity with up to 32K token generations). Code is open-sourced at https://github.com/xAlg-ai/sparse-attention-hub.

Paper Structure

This paper contains 32 sections, 7 theorems, 24 equations, 16 figures, 10 tables, 2 algorithms.

Key Result

Lemma 4.1

Let $\mathbf{s} = \sum_{i=1}^{n_s} \mathbf{r}_i , \mathbf{s}\in R^d$ be a sum of $n_s$ vector quantities $\mathbf{r}_i \in R^d \, \forall i$ which has to be estimated using a sample $\mathcal{I}_b$ of size $b$. Let $\Sigma$ be the covariance matrix for the population $\{\mathbf{r}_i\}_{i=1}^{n_s}$. for any arbitrary $\tau \in R$ and $\delta \in (0, 1)$.

Figures (16)

  • Figure 1: [Left:] vAttention accepts user tolerance parameter $\epsilon$ and ensures that sparse attention errors are controlled to be within this tolerance. [Middle] vAttention achieves a SOTA trade-off, outperforming leading methods like HashAttention and even a strong oracle top-$p$ on a mix from long-context benchmarks ( RULER32K, LongBench, Loogle). [Right] There is a strong correlation between the approximation error in the layer attention output and the user-defined parameter $\epsilon$ accepted by vAttention with verified denominator-only approximation, validating the practical relevance of $\epsilon$ parameter
  • Figure 2: Top pane: cumulative sum of attention scores sorted in descending order of magnitude, showing the number of tokens required to achieve a $p \in (0,1)$ coverage over the scores. Bottom: relative local attention errors across token budgets, indexed by head $h$ and query $q$ index
  • Figure 3: vAttention composes, sink, sliding window, and approximate top-k based attention along with random sampling based selection whose budget is governed by an adaptive sampling module which ensures user specified $(\epsilon, \delta)$ guarantees hold for each attention head every layer. The index computation and budget computation occur entirely on the GPU, and the final attention computation can retrieve the KV cache from either the GPU/CPU, depending on its location.
  • Figure 4: Pareto curves (Quality and Error vs. Density) for different baselines and their combination with vAttention across different datasets/benchmarks for Llama-3.1-8B-Instruct model. More pareto results are in Appendix \ref{['app:more_pareto']}
  • Figure 5: For Llama models with the KV cache hosted on the CPU, we observe a near-linear speedup, as inference is memory-bound and latency primarily depends on the amount of KV cache read. This experiment is conducted using naive PyTorch code for index computation, and the results can be further improved with a dedicated CUDA implementation.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Lemma 4.1: Estimating vector sum
  • Lemma 4.2
  • Theorem 4.3: $\mathbf{(\epsilon, \delta)}$ verified-$\mathbf{SDPA(K, V, q)}$
  • Lemma D.1: Estimating vector sum
  • Corollary D.2: $\mathbf{(\epsilon, \delta)}$ approximation of $\mathbf{N}$
  • Corollary D.3: $\mathbf{(\epsilon, \delta)}$ approximation of $\mathbf{D}$
  • Lemma D.4