On the Computational Hardness of Transformers

Barna Saha; Yinzhan Xu; Christopher Ye; Hantao Yu

On the Computational Hardness of Transformers

Barna Saha, Yinzhan Xu, Christopher Ye, Hantao Yu

Abstract

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$. The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than $LH$ independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime ($m = N^{o(1)}$), computing $LH$ attention heads separately takes $LHN^{2 + o(1)}$ time. We establish that this is essentially optimal under SETH. In the large embedding regime ($m = N$), one can compute $LH$ attention heads separately using $LHN^{ω+ o(1)}$ arithmetic operations (plus exponents), where $ω$ is the matrix multiplication exponent. We establish that this is optimal, by showing that $LHN^{ω- o(1)}$ arithmetic operations are necessary when $ω> 2$. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.

On the Computational Hardness of Transformers

Abstract

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of

layers, each running

attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of

tokens, each a vector of dimension

. The attention mechanism involves multiplying three

matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than

independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime (

), computing

attention heads separately takes

time. We establish that this is essentially optimal under SETH. In the large embedding regime (

), one can compute

attention heads separately using

arithmetic operations (plus exponents), where

is the matrix multiplication exponent. We establish that this is optimal, by showing that

arithmetic operations are necessary when

. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.

Paper Structure (26 sections, 24 theorems, 125 equations, 2 figures)

This paper contains 26 sections, 24 theorems, 125 equations, 2 figures.

Introduction
Our Contributions
Technical Overview
Small Embedding Dimension.
Large Embedding Dimension.
Further Related Work
The Expressivity of Transformers.
Algorithms and Hardness for Transformer Computation.
Preliminaries
Arithmetic Circuits
Transformers
Softmax Simulates Hardmax.
Fine-Grained Complexity
Lower Bounds for Small Embedding Dimension
Tools for Algebraic Circuits
...and 11 more sections

Key Result

Theorem 1.1

Under the $3\textup{-}\mathsf{OV}$ Hypothesis or SETH, any algorithm computing $L$ layer, $H$ head transformers with summation aggregation and embedding dimension $m = \Omega\left(\log N\right)$ requires time. The lower bound holds even if the transformer has no MLPs.

Figures (2)

Figure 1: An attention head with input $X \in \mathbb{R}^{N \times {d_{\textup{in}}}}$, input length $N$ and embedding dimension $m$. $Q(X), K(X), V(X)$ are obtained from $X$ with row-wise embedding maps. Then, the output is computed by $\mathrm{softmax}(Q(X)K(X)^{\top})V(X)$.
Figure 2: A transformer with $H = 3$ and $L = 3$. Attention heads are denoted with light gray and MLPs are denoted by $\Psi$. Residual connections, denoted with dotted lines, additionally add $X^{(\ell - 1)}$ to $X^{(\ell)}$.

Theorems & Definitions (59)

Theorem 1.1: Informal \ref{['thm:small-embedding-formal']}
Theorem 1.2: Informal \ref{['thm:linear-embedding-formal']}
Definition 2.1
Definition 2.2
Definition 2.3: Attention
Definition 2.4: Hardmax Attention
Definition 2.5
Definition 2.6: Transformer
Definition 2.7
Definition 2.8
...and 49 more

On the Computational Hardness of Transformers

Abstract

On the Computational Hardness of Transformers

Authors

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (59)