Table of Contents
Fetching ...

Saturated Transformers are Constant-Depth Threshold Circuits

William Merrill, Ashish Sabharwal, Noah A. Smith

TL;DR

The paper analyzes saturated transformers through circuit complexity, showing that hard-attention limits to AC^0 are surpassed when using saturated attention. It proves that transformers with floating-point activations can be simulated by constant-depth threshold circuits, placing them in TC^0, while rational-valued variants can achieve universal language recognition under size-preserving assumptions. Empirically, saturating attention enables recognition of the majority language, which lies outside AC^0, and the paper proves that per-token representations stay within O(log n) bits, enabling a TC^0 implementation. Together, these results reposition saturated attention as a meaningful bridge between practical transformer capabilities and formal circuit-based power, with future work on uniformity and comparisons to soft attention.

Abstract

Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al. 2021). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class $\mathsf{TC}^0$ as an upper bound on the formal languages they recognize.

Saturated Transformers are Constant-Depth Threshold Circuits

TL;DR

The paper analyzes saturated transformers through circuit complexity, showing that hard-attention limits to AC^0 are surpassed when using saturated attention. It proves that transformers with floating-point activations can be simulated by constant-depth threshold circuits, placing them in TC^0, while rational-valued variants can achieve universal language recognition under size-preserving assumptions. Empirically, saturating attention enables recognition of the majority language, which lies outside AC^0, and the paper proves that per-token representations stay within O(log n) bits, enabling a TC^0 implementation. Together, these results reposition saturated attention as a meaningful bridge between practical transformer capabilities and formal circuit-based power, with future work on uniformity and comparisons to soft attention.

Abstract

Transformers have become a standard neural network architecture for many NLP problems, motivating theoretical analysis of their power in terms of formal languages. Recent work has shown that transformers with hard attention are quite limited in power (Hahn, 2020), as they can be simulated by constant-depth AND/OR circuits (Hao et al. 2021). However, hard attention is a strong assumption, which may complicate the relevance of these results in practice. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We first show that saturated transformers transcend the known limitations of hard-attention transformers. We then prove saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class as an upper bound on the formal languages they recognize.

Paper Structure

This paper contains 39 sections, 10 theorems, 27 equations, 3 figures, 1 table.

Key Result

Theorem 1

$\mathsf{AHAT}(\mathbb Q) = \mathsf{ALL}$.

Figures (3)

  • Figure 1: A circuit that takes a string $\in \{0,1\}^5$ and returns whether it contains the bigram $11$.
  • Figure 2: A program recognizing $\textsc{maj}$ in RASP, a programming language designed to abstract away details of transformer computation weiss2021thinking. frac{0,1} measure the fraction of inputs that are $0$ or $1$. Then maj checks whether frac1 > frac0.
  • Figure 3: In practice, transformers can learn the majority language (which lies outside $\mathsf{AC}^0$). We train $1$-layer transformers on majority, where each line represents a different positional encoding scheme. Training string length was binomial with $n=100$. Trained models were then evaluated on generalization sets with $n$ ranging from $100$ to $500$. Mean length ($x$ axis) is $n/2$.

Theorems & Definitions (28)

  • Definition 1: Transformer
  • Definition 2: Hard attention
  • Definition 3: Strong saturated attention; merrill2020parameter
  • Definition 4: Weak saturated attention
  • Definition 5: Language recognition
  • Definition 6
  • Definition 7
  • Definition 8
  • Theorem 1
  • proof
  • ...and 18 more