Table of Contents
Fetching ...

Fundamental Limitations on Subquadratic Alternatives to Transformers

Josh Alman, Hantao Yu

TL;DR

This work addresses the quadratic-time bottleneck of Transformer attention by linking it to fundamental limits in document similarity tasks. It proves that, under SETH or OV Cone hardness, any subquadratic-time approach—whether heuristics or alternative architectures—cannot solve MSD/LSD variants, establishing conditional lower bounds for speeding up similarity computations. In contrast, the authors demonstrate that a deceptively simple one-layer Transformer with a single attention head can solve OV, MSD, and LSD variants, providing a concrete representational separation from subquadratic methods. The findings imply that for tasks involving document similarity, practitioners cannot rely on subquadratic attention or architectures to match Transformer performance without incurring fundamental computational penalties, highlighting the theoretical limits of speedups in this domain.

Abstract

The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.

Fundamental Limitations on Subquadratic Alternatives to Transformers

TL;DR

This work addresses the quadratic-time bottleneck of Transformer attention by linking it to fundamental limits in document similarity tasks. It proves that, under SETH or OV Cone hardness, any subquadratic-time approach—whether heuristics or alternative architectures—cannot solve MSD/LSD variants, establishing conditional lower bounds for speeding up similarity computations. In contrast, the authors demonstrate that a deceptively simple one-layer Transformer with a single attention head can solve OV, MSD, and LSD variants, providing a concrete representational separation from subquadratic methods. The findings imply that for tasks involving document similarity, practitioners cannot rely on subquadratic attention or architectures to match Transformer performance without incurring fundamental computational penalties, highlighting the theoretical limits of speedups in this domain.

Abstract

The Transformer architecture is widely deployed in many popular and impactful Large Language Models. At its core is the attention mechanism for calculating correlations between pairs of tokens. Performing an attention computation takes quadratic time in the input size, and had become the time bottleneck for transformer operations. In order to circumvent this, researchers have used a variety of approaches, including designing heuristic algorithms for performing attention computations faster, and proposing alternatives to the attention mechanism which can be computed more quickly. For instance, state space models such as Mamba were designed to replace attention with an almost linear time alternative. In this paper, we prove that any such approach cannot perform important tasks that Transformer is able to perform (assuming a popular conjecture from fine-grained complexity theory). We focus on document similarity tasks, where one is given as input many documents and would like to find a pair which is (approximately) the most similar. We prove that Transformer is able to perform this task, and we prove that this task cannot be performed in truly subquadratic time by any algorithm. Thus, any model which can be evaluated in subquadratic time - whether because of subquadratic-time heuristics for attention, faster attention replacements like Mamba, or any other reason - cannot perform this task. In other words, in order to perform tasks that (implicitly or explicitly) involve document similarity, one may as well use Transformer and cannot avoid its quadratic running time.
Paper Structure (31 sections, 24 theorems, 32 equations)

This paper contains 31 sections, 24 theorems, 32 equations.

Key Result

Theorem 1.1

Assuming $\mathsf{SETH}$, for every $\varepsilon>0$, there exists a constant $c>0$ such that $\mathsf{LSD}_{n,\ell}$ cannot be solved in $O(n^{2-\varepsilon})$ time when $\ell = c\log n$. Moreover, the same lower bound also holds for $\mathsf{LSD}_{n,\ell,t}$ for some $0< t < 1$, $\gamma\text{-}\mat

Theorems & Definitions (62)

  • Theorem 1.1: \ref{['thm: approximate LSD is hard']} and Corollary \ref{['cor: variants of LSD are OV hard']}
  • Theorem 1.2: \ref{['thm: approximate MSD is hard']} and Corollary \ref{['cor: variants of MSD are OV hard']}
  • Theorem 1.3: \ref{['thm: approximate MSD is hard']} and Corollary \ref{['cor: variants of MSD are OV hard']}
  • Theorem 1.4: \ref{['thm: transformer solves OV', 'thm: transformers can solve SD']}
  • Definition 2.1: attention
  • Definition 2.2: Multi-player perceptron
  • Definition 2.3
  • Definition 2.4: Strong Exponential Time Hypothesis ($\mathsf{SETH}$)
  • Definition 2.5: Orthogonal Vectors ($\mathsf{OV}_{n,\ell}$)
  • Conjecture 2.6: $\mathsf{OVC}$
  • ...and 52 more