Table of Contents
Fetching ...

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

William Merrill, Ashish Sabharwal

TL;DR

This work analyzes the implicit computation performed by transformers under practical precision limits. By formalizing a log-precision, space-bounded transformer model and showing containment in the uniform circuit class $\logspace$-uniform $\mathsf{TC}^0$, it reveals a parallelism-driven limitation on expressive power despite massive parallel computation. The results imply that, beyond a certain scale, highly parallel architectures may be inherently unable to solve certain poly-time problems, unless the complexity class hierarchy collapses (e.g., $\mathsf{L} = \mathsf{P}$). The paper also develops the notions of instruction-following and advice transformers, demonstrating how such models can simulate $\mathsf{TC}^0$ computations given appropriate instructions or advice. Together, these findings illuminate a fundamental tradeoff between parallelism and computational capability in large-scale transformer-style models, with implications for theory, circuit extraction, and empirical testing of complexity separations.

Abstract

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if $\mathsf L \neq \mathsf P$ (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.

The Parallelism Tradeoff: Limitations of Log-Precision Transformers

TL;DR

This work analyzes the implicit computation performed by transformers under practical precision limits. By formalizing a log-precision, space-bounded transformer model and showing containment in the uniform circuit class -uniform , it reveals a parallelism-driven limitation on expressive power despite massive parallel computation. The results imply that, beyond a certain scale, highly parallel architectures may be inherently unable to solve certain poly-time problems, unless the complexity class hierarchy collapses (e.g., ). The paper also develops the notions of instruction-following and advice transformers, demonstrating how such models can simulate computations given appropriate instructions or advice. Together, these findings illuminate a fundamental tradeoff between parallelism and computational capability in large-scale transformer-style models, with implications for theory, circuit extraction, and empirical testing of complexity separations.

Abstract

Despite their omnipresence in modern NLP, characterizing the computational power of transformer neural nets remains an interesting open question. We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens (and whose feedforward nets are computable using space linear in their input) can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory. For example, if (i.e., not all poly-time problems can be solved using logarithmic space), then transformers cannot even accurately solve linear equalities or check membership in an arbitrary context-free grammar with empty productions. Our result intuitively emerges from the transformer architecture's high parallelizability. We thus speculatively introduce the idea of a fundamental parallelism tradeoff: any model architecture as parallelizable as the transformer will obey limitations similar to it. Since parallelism is key to training models at massive scale, this suggests a potential inherent weakness of the scaling paradigm.
Paper Structure (34 sections, 12 theorems, 11 equations)

This paper contains 34 sections, 12 theorems, 11 equations.

Key Result

Lemma 1

Let $f : \{0, 1\}^* \to \{0, 1\}^m$ be a function. For all $c \in \mathbb{R}^+$ and $n \in \mathbb{N}$, there exists an AND/OR circuit of size at most $n^c + c \log n + m$ and depth $3$ that computes $f$ on inputs of size $c \log n$.

Theorems & Definitions (34)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Definition 7
  • Definition 8
  • Definition 9: $p$-precision transformer layer
  • Definition 10: Transformer layer computation
  • ...and 24 more