Table of Contents
Fetching ...

Support Basis: Fast Attention Beyond Bounded Entries

Maryam Aliakbarpour, Vladimir Braverman, Junze Yin, Haochen Zhang

TL;DR

This paper tackles the quadratic complexity of softmax attention in transformers by introducing support-basis decomposition to relax the bounded-entry assumption. It develops a single-threshold framework under sub-Gaussianity and a multi-threshold extension that works without distributional assumptions, combining exact computations on large entries with polynomial-approximation and sketching for dense parts to achieve sub-quadratic runtime. The authors also provide a theoretical bridge showing softmax attention is closely approximated by a combination of polynomial attentions, explaining the empirical success of polynomial attention methods, and demonstrate practical benefits via runtime and accuracy improvements on multiple large-model benchmarks. Overall, the work delivers a practical, scalable approach to efficient attention with strong theoretical guarantees and supportive empirical evidence for real-world LLMs.

Abstract

The quadratic complexity of softmax attention remains a central bottleneck in scaling large language models (LLMs). [Alman and Song, NeurIPS 2023] proposed a sub-quadratic attention approximation algorithm, but it works only under the restrictive bounded-entry assumption. Since this assumption rarely holds in practice, its applicability to modern LLMs is limited. In this paper, we introduce support-basis decomposition, a new framework for efficient attention approximation beyond bounded entries. We empirically demonstrate that the entries of the query and key matrices exhibit sub-Gaussian behavior. Our approach uses this property to split large and small entries, enabling exact computation on sparse components and polynomial approximation on dense components. We establish rigorous theoretical guarantees, proving a sub-quadratic runtime, and extend the method to a multi-threshold setting that eliminates all distributional assumptions. Furthermore, we provide the first theoretical justification for the empirical success of polynomial attention [Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be closely approximated by a combination of multiple polynomial attentions with sketching.

Support Basis: Fast Attention Beyond Bounded Entries

TL;DR

This paper tackles the quadratic complexity of softmax attention in transformers by introducing support-basis decomposition to relax the bounded-entry assumption. It develops a single-threshold framework under sub-Gaussianity and a multi-threshold extension that works without distributional assumptions, combining exact computations on large entries with polynomial-approximation and sketching for dense parts to achieve sub-quadratic runtime. The authors also provide a theoretical bridge showing softmax attention is closely approximated by a combination of polynomial attentions, explaining the empirical success of polynomial attention methods, and demonstrate practical benefits via runtime and accuracy improvements on multiple large-model benchmarks. Overall, the work delivers a practical, scalable approach to efficient attention with strong theoretical guarantees and supportive empirical evidence for real-world LLMs.

Abstract

The quadratic complexity of softmax attention remains a central bottleneck in scaling large language models (LLMs). [Alman and Song, NeurIPS 2023] proposed a sub-quadratic attention approximation algorithm, but it works only under the restrictive bounded-entry assumption. Since this assumption rarely holds in practice, its applicability to modern LLMs is limited. In this paper, we introduce support-basis decomposition, a new framework for efficient attention approximation beyond bounded entries. We empirically demonstrate that the entries of the query and key matrices exhibit sub-Gaussian behavior. Our approach uses this property to split large and small entries, enabling exact computation on sparse components and polynomial approximation on dense components. We establish rigorous theoretical guarantees, proving a sub-quadratic runtime, and extend the method to a multi-threshold setting that eliminates all distributional assumptions. Furthermore, we provide the first theoretical justification for the empirical success of polynomial attention [Kacham, Mirrokni, and Zhong, ICML 2024], showing that softmax attention can be closely approximated by a combination of multiple polynomial attentions with sketching.

Paper Structure

This paper contains 86 sections, 36 theorems, 271 equations, 30 figures, 1 table, 3 algorithms.

Key Result

Theorem 1.4

Let $Q, K, V \in \mathbb{R}^{n \times d}$ be the query, key, and value matrices. Suppose entries of $Q, K$ are independent and sub-Gaussian with variance proxies $\sigma_Q^2$ and $\sigma_K^2$, respectively. Let $\epsilon, \delta \in (0, 0.1)$ respectively be the accuracy parameter and failure probab where $A = \exp(QK^\top / d)$ and $D = \mathrm{diag}(A \cdot \mathbf{1}_n)$.

Figures (30)

  • Figure 1: Distribution of entries in the query and key matrices of Layer 10 in TinyLlama-1.1B. The red dashed lines mark the thresholds $\pm\sqrt{\log n}$. Distributions of various other layers and transformer models are deferred to the Appendix \ref{['sec:distribution']}.
  • Figure 2: A visualization of the polynomial method. For simplicity, we set $d = 1$. The matrix $QK^\top$ has rank 1, but the entrywise exponential function $\exp$ may map it to a full-rank matrix. as23 shows that replacing $\exp$ with a degree-$g$ Chebyshev polynomial $p$ can produce a matrix of rank $r$.
  • Figure 3: A visualization of $A^{\left(L\right)} V$. By the definition of $A^{\left(L\right)}$, $\mathop{\mathrm{supp}}\nolimits\left(A^{\left(L\right)}\right) = \mathop{\mathrm{supp}}\nolimits\left(Q^{\left(L\right)}K^\top + Q^{\left(s\right)}\left(K^{\left(L\right)}\right)^\top\right)$. Therefore, visualizing $A^{\left(L\right)} V$ is equivalent to visualize $\left(Q^{\left(L\right)}K^\top + Q^{\left(s\right)}\left(K^{\left(L\right)}\right)\right) V$. By definition, $Q^{\left(L\right)}, K^{\left(L\right)}$ are sparse matrices with $\left |\mathop{\mathrm{supp}}\nolimits\left(Q^{\left(L\right)}\right) \right| = \left |\mathop{\mathrm{supp}}\nolimits\left(K^{\left(L\right)}\right) \right| = O\left(n^\alpha\right)$ for $\alpha \in \left(0, 1\right)$, so there are at most $O\left(n^\alpha\right)$ non-zero rows (red blocks) in $Q^{\left(L\right)}K^\top$ and $O\left(n^\alpha\right)$ non-zero columns (blue blocks) in $Q^{\left(s\right)}\left(K^{\left(L\right)}\right)^\top$.
  • Figure 4: Runtime and accuracy comparison of attention approximation methods. Our single-threshold support-basis method achieves sub-quadratic efficiency as the threshold increases, eventually outperforming exact attention and matching the performance of as23. Moreover, our method consistently yields lower relative Frobenius norm error compared with as23. Once the threshold becomes sufficiently large (around 0.4), all entries are approximated, and the error becomes comparable to as23.
  • Figure 5: Distribution of entries in the query and key matrices of Layer 0 in TinyLlama-1.1B. The red dashed lines mark the thresholds $\pm\sqrt{\log n}$.
  • ...and 25 more figures

Theorems & Definitions (96)

  • Definition 1.1: $\left(Q, K\right)$-softmax-attention matrix
  • Definition 1.2: The exact attention computation
  • Definition 1.3: The approximate attention computation
  • Theorem 1.4: Informal version of Theorem \ref{['thm:subgaussian-main']}
  • Theorem 1.5: Informal version of Theorem \ref{['thm:attention_approximation_bucketing']}
  • Definition A.1
  • Lemma A.2: Lemma 3.2 in as23
  • Lemma A.3: Corollary 2.2 in as23
  • Lemma A.4: An improved version of Lemma 3.4 in as23
  • proof
  • ...and 86 more