Table of Contents
Fetching ...

On the Computational Hardness of Transformers

Barna Saha, Yinzhan Xu, Christopher Ye, Hantao Yu

Abstract

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$. The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than $LH$ independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime ($m = N^{o(1)}$), computing $LH$ attention heads separately takes $LHN^{2 + o(1)}$ time. We establish that this is essentially optimal under SETH. In the large embedding regime ($m = N$), one can compute $LH$ attention heads separately using $LHN^{ω+ o(1)}$ arithmetic operations (plus exponents), where $ω$ is the matrix multiplication exponent. We establish that this is optimal, by showing that $LHN^{ω- o(1)}$ arithmetic operations are necessary when $ω> 2$. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.

On the Computational Hardness of Transformers

Abstract

The transformer has revolutionized modern AI across language, vision, and beyond. It consists of layers, each running attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of tokens, each a vector of dimension . The attention mechanism involves multiplying three matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime (), computing attention heads separately takes time. We establish that this is essentially optimal under SETH. In the large embedding regime (), one can compute attention heads separately using arithmetic operations (plus exponents), where is the matrix multiplication exponent. We establish that this is optimal, by showing that arithmetic operations are necessary when . Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.
Paper Structure (26 sections, 24 theorems, 125 equations, 2 figures)

This paper contains 26 sections, 24 theorems, 125 equations, 2 figures.

Key Result

Theorem 1.1

Under the $3\textup{-}\mathsf{OV}$ Hypothesis or SETH, any algorithm computing $L$ layer, $H$ head transformers with summation aggregation and embedding dimension $m = \Omega\left(\log N\right)$ requires time. The lower bound holds even if the transformer has no MLPs.

Figures (2)

  • Figure 1: An attention head with input $X \in \mathbb{R}^{N \times {d_{\textup{in}}}}$, input length $N$ and embedding dimension $m$. $Q(X), K(X), V(X)$ are obtained from $X$ with row-wise embedding maps. Then, the output is computed by $\mathrm{softmax}(Q(X)K(X)^{\top})V(X)$.
  • Figure 2: A transformer with $H = 3$ and $L = 3$. Attention heads are denoted with light gray and MLPs are denoted by $\Psi$. Residual connections, denoted with dotted lines, additionally add $X^{(\ell - 1)}$ to $X^{(\ell)}$.

Theorems & Definitions (59)

  • Theorem 1.1: Informal \ref{['thm:small-embedding-formal']}
  • Theorem 1.2: Informal \ref{['thm:linear-embedding-formal']}
  • Definition 2.1
  • Definition 2.2
  • Definition 2.3: Attention
  • Definition 2.4: Hardmax Attention
  • Definition 2.5
  • Definition 2.6: Transformer
  • Definition 2.7
  • Definition 2.8
  • ...and 49 more