Fast Attention Requires Bounded Entries

Josh Alman; Zhao Song

Fast Attention Requires Bounded Entries

Josh Alman, Zhao Song

TL;DR

This work analyzes the speed of inner-product attention when input entries are bounded. It introduces exact and approximate attention problems, showing a sharp transition at $B=\Theta(\sqrt{\log n})$: with $d=O(\log n)$ and $B=o(\sqrt{\log n})$ one can achieve near-linear time via a polynomial-method-based low-rank approximation of the attention matrix, while under SETH a subquadratic algorithm is impossible at $B=\Theta(\sqrt{\log n})$. The core approach combines a tight polynomial approximation of the exponential with low-rank matrix techniques and a reduction from approximate nearest neighbor search to establish hardness, thereby explaining practical speedups observed when matrix entries are small. The results connect attention computation to KDE and ANN literature, offering both algorithmic gains and hardness evidence for bounded-entry regimes and guiding future exploration of KDE-inspired acceleration methods in transformers.

Abstract

In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $Ω(n^2)$ even when $d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = Θ(\sqrt{\log n})$. $\bullet$ If $d = O(\log n)$ and $B = o(\sqrt{\log n})$, there is an $n^{1+o(1)}$ time algorithm to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error. $\bullet$ If $d = O(\log n)$ and $B = Θ(\sqrt{\log n})$, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error in truly subquadratic time $n^{2 - Ω(1)}$. This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

Fast Attention Requires Bounded Entries

TL;DR

This work analyzes the speed of inner-product attention when input entries are bounded. It introduces exact and approximate attention problems, showing a sharp transition at

: with

and

one can achieve near-linear time via a polynomial-method-based low-rank approximation of the attention matrix, while under SETH a subquadratic algorithm is impossible at

. The core approach combines a tight polynomial approximation of the exponential with low-rank matrix techniques and a reduction from approximate nearest neighbor search to establish hardness, thereby explaining practical speedups observed when matrix entries are small. The results connect attention computation to KDE and ANN literature, offering both algorithmic gains and hardness evidence for bounded-entry regimes and guiding future exploration of KDE-inspired acceleration methods in transformers.

Abstract

, and the goal is to construct the matrix

, where

is the `attention matrix', and

is applied entry-wise. Straightforward methods for this problem explicitly compute the

attention matrix

, and hence require time

even when

is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix

. We present two results, showing that there is a sharp transition at

and

, there is an

time algorithm to approximate

up to

additive error.

and

, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate

up to

additive error in truly subquadratic time

. This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.

Paper Structure (20 sections, 14 theorems, 62 equations, 1 algorithm)

This paper contains 20 sections, 14 theorems, 62 equations, 1 algorithm.

Introduction
Our Results
Technique Overview
Preliminaries
Additive Error for Polynomial Approximation
From Additive Error to Relative Error
Attention Algorithm
Matrix Low-Rank Approximation
From Low Degree Polynomials to Low Rank Matrices
Matrix Has Bounded Entries
Key Lemma
From to
From and to Attention Matrix
Main Upper Bound
Proof of
...and 5 more sections

Key Result

Theorem 1.3

Assuming $\mathsf{SETH}$, for every $q>0$, there are constants $C,C_a,C_b>0$ such that: there is no $O(n^{2-q})$ time algorithm for the problem $\mathsf{AAttC}(n,d = C \log n,B= C_b \sqrt{\log n},\epsilon_a = n^{-C_a})$.

Theorems & Definitions (30)

Definition 1.1: Exact Attention Computation $\mathsf{EAttC}(n,d)$
Definition 1.2: Approximate Attention Computation $\mathsf{AAttC}(n,d, B, \epsilon_a)$
Theorem 1.3: Lower bound, informal version of Theorem \ref{['thm:formal_main_lower_bound']}
Theorem 1.4: Upper bound, informal version of Theorem \ref{['thm:formal_main_upper_bound']}
Lemma 2.1: aa22
Corollary 2.2
proof
Definition 3.1
Lemma 3.2
proof
...and 20 more

Fast Attention Requires Bounded Entries

TL;DR

Abstract

Fast Attention Requires Bounded Entries

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (30)