Table of Contents
Fetching ...

Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees

Vashista Nobaub

TL;DR

This work introduces a geometric theory of sparse long-context decoding for attention by modeling the attention mechanism as a projection onto the convex hull of key vectors and applying entropic regularization. A face-stability main theorem shows that, with a positive face gap $\Delta(q)>0$, mass concentrates on a small active face and leakage off-face decays exponentially as $\exp(-\Delta(q)/(2\varepsilon))$, enabling constant-time per-token decoding. The authors present Vashista Sparse Attention, a practical drop-in mechanism with paging-based context selection that achieves $O(PD+K_cD)$ decode complexity, where $P$ is the routed page count and $K_c$ the candidate set size, independent of context length $T$ under the stability regime. They provide diagnostics to detect the presence of a gap, validate the approach experimentally in industrial-style long-context serving settings, and outline deployment guidance for privacy-conscious and air-gapped environments, with implications for retrieval-augmented generation and long-horizon agents. Overall, the paper offers a principled framework and engineering pathway to reliable, scalable long-context decoding with provable guarantees and practical tooling for production deployments.

Abstract

Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (Δ) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-Ω(Δ/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.

Attention in Constant Time: Vashista Sparse Attention for Long-Context Decoding with Exponential Guarantees

TL;DR

This work introduces a geometric theory of sparse long-context decoding for attention by modeling the attention mechanism as a projection onto the convex hull of key vectors and applying entropic regularization. A face-stability main theorem shows that, with a positive face gap , mass concentrates on a small active face and leakage off-face decays exponentially as , enabling constant-time per-token decoding. The authors present Vashista Sparse Attention, a practical drop-in mechanism with paging-based context selection that achieves decode complexity, where is the routed page count and the candidate set size, independent of context length under the stability regime. They provide diagnostics to detect the presence of a gap, validate the approach experimentally in industrial-style long-context serving settings, and outline deployment guidance for privacy-conscious and air-gapped environments, with implications for retrieval-augmented generation and long-horizon agents. Overall, the paper offers a principled framework and engineering pathway to reliable, scalable long-context decoding with provable guarantees and practical tooling for production deployments.

Abstract

Large language models spend most of their inference cost on attention over long contexts, yet empirical behavior suggests that only a small subset of tokens meaningfully contributes to each query. We formalize this phenomenon by modeling attention as a projection onto the convex hull of key vectors and analyzing its entropic (softmax-like) relaxation. Our main theoretical contribution is a face-stability theorem showing that, under a strict complementarity margin (a support gap (Δ) certified by KKT multipliers), entropic attention concentrates on a constant-size active face: the total mass assigned to inactive tokens decays exponentially as (\exp(-Ω(Δ/\varepsilon))), while the error on the active face scales linearly in the temperature/regularization parameter (\varepsilon). This yields a practical criterion for when sparse long-context decoding is safe and provides a principled knob to trade accuracy for compute. Building on these guarantees, we introduce Vashista Sparse Attention, a drop-in mechanism that maintains a small candidate set per query through a paging-style context selection strategy compatible with modern inference stacks. Across long-context evaluations, we observe stable constant-size effective support, strong wall-clock speedups, and minimal quality degradation in the regimes predicted by the support-gap diagnostics. Finally, we discuss deployment implications for privacy-sensitive and air-gapped settings, where interchangeable attention modules enable predictable latency and cost without external retrieval dependencies.
Paper Structure (85 sections, 15 theorems, 78 equations, 2 figures, 3 tables)

This paper contains 85 sections, 15 theorems, 78 equations, 2 figures, 3 tables.

Key Result

Lemma 1

Let $U=[u_1,\dots,u_M]\in\mathbb{R}^{d\times M}$ and $K=\mathrm{conv}\{u_1,\dots,u_M\}$. For a query $q\in\mathbb{R}^d$, consider the (unregularized) projection written in simplex form There exist multipliers $\nu^{\ast}\in\mathbb{R}$ and $\mu^{\ast}\in\mathbb{R}^M_{+}$ such that Let $I=\{i:\alpha_i^{\ast}>0\}$ denote the active set (the indices of the optimal face). Then $\mu_i^{\ast}=0$ for al

Figures (2)

  • Figure 1: Decode attention time scaling (dense vs. sparse) as context length increases.
  • Figure 2: Sparse decode attention time heatmaps over $(P,K_c)$ for EG (left) and Frank--Wolfe (right).

Theorems & Definitions (30)

  • Definition 1: Face gap / strict complementarity margin
  • Definition 2: Tangent conditioning on the active face
  • Lemma 1: Multiplier identity and support gap
  • Corollary 1: Exponential off-face leakage for entropic attention
  • Theorem 1: Face stability with exponential leakage
  • Remark 1: Interpretation
  • Corollary 2: Off-face mass is exponentially small
  • Theorem 2: Vashista's deterministic theorem: tangent bias + exponential leakage (consolidated)
  • Corollary 3: Implementation complexity: $O(PD + K_cD)$ decode
  • Lemma 2: Support gap equals the minimum off-face multiplier
  • ...and 20 more