Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Charles London; Varun Kanade

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

Charles London, Varun Kanade

TL;DR

The paper provides a rigorous complexity-theoretic account of pause tokens in Transformers, showing that polynomially many pause tokens elevate constant-precision models to $AC^0$ and logarithmic-precision models to $TC^0$. It distinguishes pause tokens from chain-of-thought prompting by demonstrating that they enlarge parallel computation rather than sequential depth, and it offers empirical evidence that pause tokens aid learning of parity under causal masking. The results illuminate how pause tokens interact with precision, width, and depth, and position them as a distinct mechanism to enhance Transformer reasoning with practical implications for quantization and model faithfulness. Overall, the work connects empirical observations to formal expressivity gains, suggesting pause tokens expand the computational workspace available to fixed-depth Transformers in a provably structured way.

Abstract

Pause tokens, simple filler symbols such as "...", consistently improve Transformer performance on both language and mathematical tasks, yet their theoretical effect remains unexplained. We provide the first formal separation result, proving that adding pause tokens to constant-depth, logarithmic-width Transformers strictly increases their computational expressivity. With bounded-precision activations, Transformers without pause tokens compute only a strict subset of $\mathsf{AC}^0$ functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to $\mathsf{TC}^0$, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

TL;DR

The paper provides a rigorous complexity-theoretic account of pause tokens in Transformers, showing that polynomially many pause tokens elevate constant-precision models to

and logarithmic-precision models to

. It distinguishes pause tokens from chain-of-thought prompting by demonstrating that they enlarge parallel computation rather than sequential depth, and it offers empirical evidence that pause tokens aid learning of parity under causal masking. The results illuminate how pause tokens interact with precision, width, and depth, and position them as a distinct mechanism to enhance Transformer reasoning with practical implications for quantization and model faithfulness. Overall, the work connects empirical observations to formal expressivity gains, suggesting pause tokens expand the computational workspace available to fixed-depth Transformers in a provably structured way.

Abstract

functions, while adding a polynomial number of pause tokens allows them to express the entire class. For logarithmic-precision Transformers, we show that adding pause tokens achieves expressivity equivalent to

, matching known upper bounds. Empirically, we demonstrate that two-layer causally masked Transformers can learn parity when supplied with pause tokens, a function that they appear unable to learn without them. Our results provide a rigorous theoretical explanation for prior empirical findings, clarify how pause tokens interact with width, depth, and numeric precision, and position them as a distinct mechanism, complementary to chain-of-thought prompting, for enhancing Transformer reasoning.

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

TL;DR

Abstract

Pause Tokens Strictly Increase the Expressivity of Constant-Depth Transformers

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (41)