Table of Contents
Fetching ...

Exact Expressive Power of Transformers with Padding

William Merrill, Ashish Sabharwal

TL;DR

The paper investigates how padding and looping can systematically enlarge transformer expressive power at inference time without adding parameters. It proves that fixed-depth padded transformers achieve exactly $FO$-uniform $TC^0$, and that introducing $O( ext{polylog } n)$ looping on top of polynomial padding yields exactly $FO$-uniform $TC^d$, which aligns with $NC$ under standard complexity conjectures. A uniformity-collapse result further shows that $FO$-uniform, $L$-uniform, and $NL$-uniform circuit classes converge for $d\,\geq\,1$, strengthening the bridge between transformer models and circuit complexity. The work motivates exploiting padding and looping as practical, highly parallelizable alternatives to chain-of-thought for certain reasoning tasks, while clarifying the precise theoretical boundaries of such approaches. Overall, it provides a rigorous foundation for using padding and looping to tune inference-time computational power in transformers, with open questions about empirical applicability and finer-grained power across padding depths.

Abstract

Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{TC}^0$ of extremely parallelizable problems. While the $\mathsf{TC}^0$ upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with $O(\log^d n)$ looping on inputs of length $n$ recognize exactly the class $\mathsf{FO}$-uniform $\mathsf{TC}^d$ of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, polynomially padded transformers recognize precisely the class $\mathsf{FO}$-uniform $\mathsf{NC}$, the best that could be expected without losing parallelism (unless $\mathsf{NC} = \mathsf{P}$). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought for test-time compute.

Exact Expressive Power of Transformers with Padding

TL;DR

The paper investigates how padding and looping can systematically enlarge transformer expressive power at inference time without adding parameters. It proves that fixed-depth padded transformers achieve exactly -uniform , and that introducing looping on top of polynomial padding yields exactly -uniform , which aligns with under standard complexity conjectures. A uniformity-collapse result further shows that -uniform, -uniform, and -uniform circuit classes converge for , strengthening the bridge between transformer models and circuit complexity. The work motivates exploiting padding and looping as practical, highly parallelizable alternatives to chain-of-thought for certain reasoning tasks, while clarifying the precise theoretical boundaries of such approaches. Overall, it provides a rigorous foundation for using padding and looping to tune inference-time computational power in transformers, with open questions about empirical applicability and finer-grained power across padding depths.

Abstract

Chain of thought is a natural inference-time method for increasing the computational power of transformer-based large language models (LLMs), but comes at the cost of sequential decoding. Are there more efficient alternatives to expand a transformer's expressive power without adding parameters? We consider transformers with padding tokens as a form of parallelizable test-time compute. We show that averaging-hard-attention, masked-pre-norm transformers with polynomial padding recognize precisely the class -uniform of extremely parallelizable problems. While the upper bound was known, proving a matching lower bound had been elusive. Further, our novel analysis reveals the precise expanded power of padded transformers when coupled with another form of inference-time compute, namely dynamically increasing depth via looping. Our core technical contribution is to show how padding helps bring the notions of complete problems and reductions, which have been a cornerstone of classical complexity theory, to the formal study of transformers. Armed with this new tool, we prove that padded transformers with looping on inputs of length recognize exactly the class -uniform of moderately parallelizable problems. Thus, padding and looping together systematically expand transformers' expressive power: with polylogarithmic looping, polynomially padded transformers recognize precisely the class -uniform , the best that could be expected without losing parallelism (unless ). Our results thus motivate further exploration of padding and looping as parallelizable alternatives to chain of thought for test-time compute.

Paper Structure

This paper contains 22 sections, 30 theorems, 13 equations, 1 figure.

Key Result

Lemma 1

Let $E$ be an unmasked (with position encoding $1/i$) AHAT encoder with depth $\ell \geq 1$. Then there exists a causally masked AHAT decoder $D$ (without any position encoding) with depth $\ell$ and with $\ell n$ padding tokens on input length $n$ that is equivalent to $E$ in the following sense: t

Figures (1)

  • Figure 1: Summary of core results: exact characterizations of the expressive power of $O(\log^d n)$-depth looped AHATs with padding, for $d \geq 0$. \ref{['thm:tc0']} shows that $\mathsf{AHAT}^0_* = \mathsf{FO}\textrm{-uniform}\xspace\ \mathsf{TC}^0$. \ref{['thm:tck']} extends this to show that, for $d \geq 0$, $\mathsf{AHAT}^d_* = \mathsf{FO}\textrm{-uniform}\xspace\ \mathsf{TC}^d$. In the process of obtaining these results, we also found the novel circuit complexity result that, for any $d \geq 1$, $\mathsf{FO}\textrm{-uniform}\xspace\ \mathsf{TC}^d = \mathsf{L}\textrm{-uniform}\xspace\ \mathsf{TC}^d$\ref{['thm:uniformity-collapse-ACd-TCd']}. Thus, for $d \geq 1$, $\mathsf{AHAT}^d_* = \mathsf{L}\textrm{-uniform}\xspace\ \mathsf{TC}^d$.

Theorems & Definitions (68)

  • Definition 1: Self-attention sublayer
  • Definition 2: Feedforward sublayer
  • Definition 3
  • Definition 4: $d(n)$-looped transformer
  • Definition 5: $w(n)$-padded transformer
  • Definition 6: Padded and looped transformers
  • Definition 7
  • Definition 8: barrington-1990-uniformity
  • Lemma 1: Unmasked to Causally Masked
  • proof
  • ...and 58 more