How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Yichuan Deng; Zhao Song; Jing Xiong; Chiwun Yang

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Yichuan Deng, Zhao Song, Jing Xiong, Chiwun Yang

TL;DR

This work provides a theoretical framework showing that standard attention is naturally sparse, with the effective number of large entries per row growing only as a sublinear function of input length. It derives concentration bounds for the underlying quantities, introduces the notion of attention collapse, and proves that a sparsity window of size $k=\Omega(n^{C})$ (for any constant $C\in(0,1)$) suffices to approximate exact attention with vanishing error, while $k=o(\\log n)$ is insufficient. It then advocates a dynamic Top-$k$ strategy, $k=\alpha n^{C}$, over fixed windows to achieve better accuracy-efficiency trade-offs and validates these insights with empirical simulations and long-context benchmarks. The findings provide concrete guidance for designing sub-quadratic attention mechanisms and understanding when sparse attention can be reliable in practice. Overall, the results offer theoretical justification for adaptive sparsity in attention and suggest robust directions for sparse transformer architectures in long-context settings.

Abstract

Sparse Attention is a technique that approximates standard attention computation with sub-quadratic complexity. This is achieved by selectively ignoring smaller entries in the attention matrix during the softmax function computation. Variations of this technique, such as pruning KV cache, sparsity-based fast attention, and Sparse Transformer, have been extensively utilized for efficient Large Language Models (LLMs) deployment. Despite its widespread use, a theoretical understanding of the conditions under which sparse attention performs on par with traditional attention remains elusive. This work aims to $\textbf{bridge this gap by examining the inherent sparsity of standard attention processes}$. Our theoretical framework reveals several brand-new key insights: $\bullet$ Attention is $n^{C}$-sparse, implying that considering only the largest $Ω(n^{C})$ entries out of all $n$ entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here, $n$ represents the input length and $C \in (0, 1)$ is a constant. $\bullet$ Stable $o(\log(n))$-sparse attention, which approximates attention computation with $\log(n)$ or fewer entries, may not be feasible since the error will persist at a minimum of $O(1)$. $\bullet$ An adaptive strategy ($α\cdot n^C, α\in \mathbb{R}$) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

TL;DR

(for any constant

) suffices to approximate exact attention with vanishing error, while

is insufficient. It then advocates a dynamic Top-

strategy,

, over fixed windows to achieve better accuracy-efficiency trade-offs and validates these insights with empirical simulations and long-context benchmarks. The findings provide concrete guidance for designing sub-quadratic attention mechanisms and understanding when sparse attention can be reliable in practice. Overall, the results offer theoretical justification for adaptive sparsity in attention and suggest robust directions for sparse transformer architectures in long-context settings.

Abstract

. Our theoretical framework reveals several brand-new key insights:

Attention is

-sparse, implying that considering only the largest

entries out of all

entries is sufficient for sparse attention to approximate the exact attention matrix with decreasing loss. Here,

represents the input length and

is a constant.

Stable

-sparse attention, which approximates attention computation with

or fewer entries, may not be feasible since the error will persist at a minimum of

An adaptive strategy (

) for the window size of efficient attention methods rather than a fixed one is guaranteed to perform more accurately and efficiently in a task for inference on flexible context lengths.

Paper Structure (41 sections, 22 theorems, 50 equations, 2 figures, 2 tables)

This paper contains 41 sections, 22 theorems, 50 equations, 2 figures, 2 tables.

Introduction
Related Work
Sparse and Efficient Transformer.
Theoretical Approaches to Understanding LLMs.
Preliminary
Attention Sparsity
Sparse Attention and Approximation
Theoretical Insight I: Attention is Provably Naturally Sparse
Concentrations
Attention Sparsity with Bound on Error
Theoretical Insight II: Attention Collapse
Theoretical Insight III and IV: On Stable Sparse Attention Approximation
Theoretical Insight III: Entries Suffice for Stable Sparse Attention Approximation
Theoretical Insight IV: Sparse Attention Fail to Approximate Exact Attention from Entries
Empirical Evaluation
...and 26 more sections

Key Result

Lemma 4.2

$\delta \in (0, 0.1)$. Let $R \ge 0$ be defined as Definition def:R:informal. $\forall i_1, i_2 \in [n]$. Then with a probability at least $1 - \delta$, we have

Figures (2)

Figure 1: Count of ineffective entries divided by $n$ in attention computation, denoted as ${\cal S}_{\varepsilon}/n$, decreases with an increasing $n$, where $\varepsilon$ is choosing from $\{{\rm 1e-4}, 5e-5, 2e-5, 1e-5, \cdots, 5e-7, 2e-7, 1e-7\}$.
Figure 2: Simulation of sparse attention approximation with different window size settings $k$. The $\ell_2$ norm error means the average approximating error on the attention matrix ($\frac{1}{n} \| D_{\rm spar}^{-1} A_{\rm spar} - D^{-1} A\|_2$). Note that the compute complexities of the lines with the same colors is equal.

Theorems & Definitions (54)

Definition 3.1: Attention computation
Definition 3.2: $(\epsilon, k)$-sparsity
Definition 3.3
Definition 3.4: Sparse attention
Definition 3.5: Stable sparse attention approximation ${\sf SSAA}(f)$
Definition 4.1
Lemma 4.2
proof : Proof sketch of Lemma \ref{['lem:concentrations:informal']}
Theorem 4.3
proof : Proof sketch of Theorem \ref{['thm:main_result:informal']}
...and 44 more

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

TL;DR

Abstract

How Sparse Attention Approximates Exact Attention? Your Attention is Naturally $n^C$-Sparse

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (54)