Table of Contents
Fetching ...

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Charles London, Varun Kanade

TL;DR

The paper provides a rigorous complexity-theoretic account of pause tokens in Transformers, showing that polynomially many pause tokens elevate constant-precision models to $AC^0$ and logarithmic-precision models to $TC^0$. It distinguishes pause tokens from chain-of-thought prompting by demonstrating that they enlarge parallel computation rather than sequential depth, and it offers empirical evidence that pause tokens aid learning of parity under causal masking. The results illuminate how pause tokens interact with precision, width, and depth, and position them as a distinct mechanism to enhance Transformer reasoning with practical implications for quantization and model faithfulness. Overall, the work connects empirical observations to formal expressivity gains, suggesting pause tokens expand the computational workspace available to fixed-depth Transformers in a provably structured way.

Abstract

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $\mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $\mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

TL;DR

The paper provides a rigorous complexity-theoretic account of pause tokens in Transformers, showing that polynomially many pause tokens elevate constant-precision models to and logarithmic-precision models to . It distinguishes pause tokens from chain-of-thought prompting by demonstrating that they enlarge parallel computation rather than sequential depth, and it offers empirical evidence that pause tokens aid learning of parity under causal masking. The results illuminate how pause tokens interact with precision, width, and depth, and position them as a distinct mechanism to enhance Transformer reasoning with practical implications for quantization and model faithfulness. Overall, the work connects empirical observations to formal expressivity gains, suggesting pause tokens expand the computational workspace available to fixed-depth Transformers in a provably structured way.

Abstract

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to , matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

Paper Structure

This paper contains 31 sections, 22 theorems, 24 equations, 2 figures, 2 tables.

Key Result

Theorem 4.1

$\mathsf{TF}[1, L, P] = \mathsf{AC}^0$.

Figures (2)

  • Figure 1: Two layers of a Transformer with pause tokens can simulate a layer of a Boolean circuit. In the first layer, inputs to the gates in the layer are copied to argument positions. In the second layer, these arguments are combined at the gate position to compute the output of the gate. $v_i$ represents the value of vertex $i$ in the circuit, whether that be an input or a gate. $\text{Arg}(i, j)$ tokens denote an edge from gate $j$ to gate $i$, and $\text{Type}(i)$ tokens denote a gate $i$. The red arrow represents the direction of computation.
  • Figure 2: Test accuracy on predicting the parity of a sequence for Transformers with learned positional encodings, with and without pause tokens and causal masking. Averaged over 3 random seeds.

Theorems & Definitions (41)

  • Definition 3.1
  • Definition 3.2
  • Definition 3.3
  • Definition 3.4: Fixed precision arithmetic
  • Definition 3.5: Transformer
  • Definition 3.6: Logspace uniform Transformer families
  • Theorem 4.1
  • Corollary 4.2
  • Theorem 4.3
  • Lemma 4.4
  • ...and 31 more