Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Hoang Anh Duy Le; Sahil Joshi; Zeyu Yang; Zhaozhuo Xu; Anshumali Shrivastava

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Hoang Anh Duy Le, Sahil Joshi, Zeyu Yang, Zhaozhuo Xu, Anshumali Shrivastava

TL;DR

This work tackles the quadratic cost of self-attention in long-context LLM inference by arguing that per-layer one-hop sparsification misses multi-hop dependencies that emerge via attention composition. It introduces Sketch&Walk, a training-free method that combines Small-World Sketching to cheaply estimate block-level interactions with a Sketch-Determined Walk that accumulates cross-layer influence to select top-$\tau$ blocks for sparse attention, applicable to both prefill and decode without training. Theoretical guarantees show that, under reasonable assumptions, the top-$\tau$ blocks identified by the walk recover the essential attention structure and that the resulting outputs approximate full attention within provable bounds; empirically it achieves near-lossless accuracy at $80\%$ sparsity and up to $6\times$ end-to-end speedups on long-context benchmarks. This approach enables robust, scalable long-context inference without training, with speedups that grow with context length and broad applicability across models such as Llama-3 and Qwen2.

Abstract

Self-attention dominates the computational and memory cost of long-context LLM inference across both prefill and decode phases. To address this challenge, we introduce Sketch&Walk Attention, a training-free sparse attention method that determines sparsity with lightweight sketches and deterministic walk. Sketch&Walk applies Hadamard sketching to get inexpensive approximations of attention scores, then aggregates these estimates across layers via a walk mechanism that captures attention influence beyond direct interactions between tokens. The accumulated walk scores are used to select top-k attention blocks, enabling dynamic sparsity with a single training-free algorithm that applies uniformly to both the prefill and decode phases, together with custom sparse attention kernels. Across a wide range of models and tasks, Sketch&Walk maintains near-lossless accuracy at 20% attention density and can slightly outperform dense attention in some settings, while achieving up to 6x inference speedup.

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

TL;DR

blocks for sparse attention, applicable to both prefill and decode without training. Theoretical guarantees show that, under reasonable assumptions, the top-

blocks identified by the walk recover the essential attention structure and that the resulting outputs approximate full attention within provable bounds; empirically it achieves near-lossless accuracy at

sparsity and up to

end-to-end speedups on long-context benchmarks. This approach enables robust, scalable long-context inference without training, with speedups that grow with context length and broad applicability across models such as Llama-3 and Qwen2.

Abstract

Paper Structure (22 sections, 15 theorems, 68 equations, 6 figures, 7 tables, 2 algorithms)

This paper contains 22 sections, 15 theorems, 68 equations, 6 figures, 7 tables, 2 algorithms.

Introduction
Sketch and Walk
Preliminaries and Notation
Small-World Sketching
Sketch-Determined Walk
Error Bounds and Approximation Analysis
Why Sketch&Walk for Sparse Attention?
Experiments
Settings
Accuracy Evaluation
Efficiency Evaluation
Ablation Studies
Related Works
Conclusion
Theoretical Analysis
...and 7 more sections

Key Result

Lemma 2.3

(Token-space Sketching: Subspace Embedding via Block Averaging). Under Assumption ass:block_coherence, the block average $\overline{\mathbf{q}}_i = \frac{1}{B}\sum_{t=1}^{B} \mathbf{q}_t^{(i)}$ satisfies: with probability at least $1 - \delta$.

Figures (6)

Figure 1: Overview of Sketch&Walk. (1) Queries and keys are sketched with Small-World Sketching to obtain lightweight block-level attention estimates. (2) These estimates are accumulated across layers with Sketch-Determined Walk to approximate cross-layer attention influence. (3) The resulting walk scores are used to select top-$\tau$ blocks for sparse attention.
Figure 2: Visualization of attention matrices from a layer of the Llama-3.1-8B-Instruct. Each node corresponds to a token, and edge intensity reflects the magnitude of the attention score. $A^1$ (left) shows direct attention scores. Higher-order compositions of attention are shown by $A^4$ (middle) and $A^7$ (right). While $A^1$ captures only direct interactions, higher powers approximate the influence induced by repeated attention composition, reflecting attention that becomes strong in deeper layers.
Figure 3: Prefill Phase Acceleration
Figure 4: Decode Phase Acceleration
Figure 5: Sketch&Walk Sparse Attention Kernel Analysis
...and 1 more figures

Theorems & Definitions (29)

Definition 2.1: Block Attention Score
Lemma 2.3
Theorem 2.4
Lemma 2.4
Lemma 2.4
Definition A.1: Block Attention Score
Lemma A.4: Token-space Sketching: Subspace Embedding via Block Averaging
proof
Corollary A.5: Inner Product Preservation under Token-space Sketching
proof
...and 19 more

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

TL;DR

Abstract

Scout Before You Attend: Sketch-and-Walk Sparse Attention for Efficient LLM Inference

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (29)