Table of Contents
Fetching ...

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Hoang Anh Duy Le, Sahil Joshi, Zeyu Yang, Zhaozhuo Xu, Anshumali Shrivastava

TL;DR

This work tackles the quadratic cost of self-attention in long-context LLM inference by arguing that per-layer one-hop sparsification misses multi-hop dependencies that emerge via attention composition. It introduces Sketch&Walk, a training-free method that combines Small-World Sketching to cheaply estimate block-level interactions with a Sketch-Determined Walk that accumulates cross-layer influence to select top-$\tau$ blocks for sparse attention, applicable to both prefill and decode without training. Theoretical guarantees show that, under reasonable assumptions, the top-$\tau$ blocks identified by the walk recover the essential attention structure and that the resulting outputs approximate full attention within provable bounds; empirically it achieves near-lossless accuracy at $80\%$ sparsity and up to $6\times$ end-to-end speedups on long-context benchmarks. This approach enables robust, scalable long-context inference without training, with speedups that grow with context length and broad applicability across models such as Llama-3 and Qwen2.

Abstract

Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

TL;DR

This work tackles the quadratic cost of self-attention in long-context LLM inference by arguing that per-layer one-hop sparsification misses multi-hop dependencies that emerge via attention composition. It introduces Sketch&Walk, a training-free method that combines Small-World Sketching to cheaply estimate block-level interactions with a Sketch-Determined Walk that accumulates cross-layer influence to select top- blocks for sparse attention, applicable to both prefill and decode without training. Theoretical guarantees show that, under reasonable assumptions, the top- blocks identified by the walk recover the essential attention structure and that the resulting outputs approximate full attention within provable bounds; empirically it achieves near-lossless accuracy at sparsity and up to end-to-end speedups on long-context benchmarks. This approach enables robust, scalable long-context inference without training, with speedups that grow with context length and broad applicability across models such as Llama-3 and Qwen2.

Abstract

Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.
Paper Structure (22 sections, 15 theorems, 68 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 22 sections, 15 theorems, 68 equations, 6 figures, 7 tables, 2 algorithms.

Key Result

Lemma 2.3

(Token-space Sketching: Subspace Embedding via Block Averaging). Under Assumption ass:block_coherence, the block average $\overline{\mathbf{q}}_i = \frac{1}{B}\sum_{t=1}^{B} \mathbf{q}_t^{(i)}$ satisfies: with probability at least $1 - \delta$.

Figures (6)

  • Figure 1: Overview of Sketch&Walk. (1) Queries and keys are sketched with Small-World Sketching to obtain lightweight block-level attention estimates. (2) These estimates are accumulated across layers with Sketch-Determined Walk to approximate cross-layer attention influence. (3) The resulting walk scores are used to select top-$\tau$ blocks for sparse attention.
  • Figure 2: Visualization of attention matrices from a layer of the Llama-3.1-8B-Instruct. Each node corresponds to a token, and edge intensity reflects the magnitude of the attention score. $A^1$ (left) shows direct attention scores. Higher-order compositions of attention are shown by $A^4$ (middle) and $A^7$ (right). While $A^1$ captures only direct interactions, higher powers approximate the influence induced by repeated attention composition, reflecting attention that becomes strong in deeper layers.
  • Figure 3: Prefill Phase Acceleration
  • Figure 4: Decode Phase Acceleration
  • Figure 5: Sketch&Walk Sparse Attention Kernel Analysis
  • ...and 1 more figures

Theorems & Definitions (29)

  • Definition 2.1: Block Attention Score
  • Lemma 2.3
  • Theorem 2.4
  • Lemma 2.4
  • Lemma 2.4
  • Definition A.1: Block Attention Score
  • Lemma A.4: Token-space Sketching: Subspace Embedding via Block Averaging
  • proof
  • Corollary A.5: Inner Product Preservation under Token-space Sketching
  • proof
  • ...and 19 more