Table of Contents
Fetching ...

Theoretical limitations of multi-layer Transformer

Lijie Chen, Binghui Peng, Hongxun Wu

TL;DR

This work addresses the fundamental question of how powerful multi-layer decoder-only Transformers are for performing sequential composition. It introduces a multi-party autoregressive communication model to capture decoder computations and defines the L-sequential function composition task, proving an unconditional lower bound: for constant $L$, solving $L$-sequential function composition requires $n^{\Omega(1)}$ parameter dimensions when $L \le \widetilde{O}(\log\log n)$ and $Hdp \le n^{2^{-4L}}$. As consequences, there is an exponential depth-width tradeoff (an $(L+1)$-layer Transformer can solve the task with polylogarithmic parameters, while any $L$-layer model requires polynomial resources), an unconditional encoder–decoder separation (a shallow encoder can solve with polylogarithmic size where an $L$-layer decoder cannot), and provable chain-of-thought advantages (CoT) enabling certain tasks to be solved with fewer layers. The paper avoids reliance on circuit lower bounds like $\mathsf{TC}^0$, instead leveraging the information bottleneck and autoregressive structure to establish the bounds. Collectively, these results provide a sharp, unconditional understanding of the limitations and depth-related capabilities of decoder-only Transformers and furnish a new analytic framework—the indistinguishable decomposition technique—for future exploration of Transformer power.

Abstract

Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple $1$-layer case. Due to the difficulty of analyzing multi-layer models, all previous work relies on unproven complexity conjectures to show limitations for multi-layer Transformers. In this work, we prove the first $\textit{unconditional}$ lower bound against multi-layer decoder-only transformers. For any constant $L$, we prove that any $L$-layer decoder-only transformer needs a polynomial model dimension ($n^{Ω(1)}$) to perform sequential composition of $L$ functions over an input of $n$ tokens. As a consequence, our results give: (1) the first depth-width trade-off for multi-layer transformers, exhibiting that the $L$-step composition task is exponentially harder for $L$-layer models compared to $(L+1)$-layer ones; (2) an unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought. On the technical side, we propose the multi-party $\textit{autoregressive}$ $\textit{communication}$ $\textit{model}$ that captures the computation of a decoder-only Transformer. We also introduce a new proof technique that finds a certain $\textit{indistinguishable}$ $\textit{decomposition}$ of all possible inputs iteratively for proving lower bounds in this model. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.

Theoretical limitations of multi-layer Transformer

TL;DR

This work addresses the fundamental question of how powerful multi-layer decoder-only Transformers are for performing sequential composition. It introduces a multi-party autoregressive communication model to capture decoder computations and defines the L-sequential function composition task, proving an unconditional lower bound: for constant , solving -sequential function composition requires parameter dimensions when and . As consequences, there is an exponential depth-width tradeoff (an -layer Transformer can solve the task with polylogarithmic parameters, while any -layer model requires polynomial resources), an unconditional encoder–decoder separation (a shallow encoder can solve with polylogarithmic size where an -layer decoder cannot), and provable chain-of-thought advantages (CoT) enabling certain tasks to be solved with fewer layers. The paper avoids reliance on circuit lower bounds like , instead leveraging the information bottleneck and autoregressive structure to establish the bounds. Collectively, these results provide a sharp, unconditional understanding of the limitations and depth-related capabilities of decoder-only Transformers and furnish a new analytic framework—the indistinguishable decomposition technique—for future exploration of Transformer power.

Abstract

Transformers, especially the decoder-only variants, are the backbone of most modern large language models; yet we do not have much understanding of their expressive power except for the simple -layer case. Due to the difficulty of analyzing multi-layer models, all previous work relies on unproven complexity conjectures to show limitations for multi-layer Transformers. In this work, we prove the first lower bound against multi-layer decoder-only transformers. For any constant , we prove that any -layer decoder-only transformer needs a polynomial model dimension () to perform sequential composition of functions over an input of tokens. As a consequence, our results give: (1) the first depth-width trade-off for multi-layer transformers, exhibiting that the -step composition task is exponentially harder for -layer models compared to -layer ones; (2) an unconditional separation between encoder and decoder, exhibiting a hard task for decoders that can be solved by an exponentially shallower and smaller encoder; (3) a provable advantage of chain-of-thought, exhibiting a task that becomes exponentially easier with chain-of-thought. On the technical side, we propose the multi-party that captures the computation of a decoder-only Transformer. We also introduce a new proof technique that finds a certain of all possible inputs iteratively for proving lower bounds in this model. We believe our new communication model and proof technique will be helpful to further understand the computational power of transformers.

Paper Structure

This paper contains 58 sections, 20 theorems, 98 equations.

Key Result

Theorem 1.1

Let $H$ be the number attention heads, $d$ be the head dimension, $p$ be the precision, $L$ be the number of layers, $n$ be the prompt length. For any $L \leq \widetilde{O}(\log\log(n))$, an $L$-layer decoder-only Transformer could not solve $L$-sequential function composition whenever $Hdp \leq n^{

Theorems & Definitions (37)

  • Theorem 1.1: Lower bound for multi-layer Transformer
  • Corollary 1.2: Depth-size tradeoff
  • Corollary 1.3: Separation between encoder and decoder
  • Corollary 1.4: Provably benefits of CoT
  • Definition 2.1: $L$-sequential function composition
  • Lemma 3.1: Reduction from Transformers to autoregressive communication
  • proof
  • Claim 3.2
  • proof : Proof of Lemma \ref{['claim:reduction-hypothesis']}
  • Theorem 4.1
  • ...and 27 more