On Limitations of the Transformer Architecture

Binghui Peng; Srini Narayanan; Christos Papadimitriou

On Limitations of the Transformer Architecture

Binghui Peng, Srini Narayanan, Christos Papadimitriou

TL;DR

This paper reframes transformer hallucinations as a fundamental limitation in function composition, proving that a single $H$-headed self-attention layer cannot reliably compute $f(g(x))$ when $n\log n$ exceeds a bound set by $H(d+1)p$. Using communication complexity, it derives probabilistic error bounds and extends the analysis to iterated composition and Chain-of-Thought prompts, showing that CoT requires prompts of size at least $\Omega(\sqrt{N})$ to generalize. It further connects compositional tasks to log-space complexity, arguing that several elementary problems (circuit evaluation, derivability, certain SAT variants) are intractable for multi-layer Transformers unless major complexity class equalities hold (e.g., $\text{L}=\text{NL}$, etc.). The results collectively suggest fundamental, architecture-rooted barriers to robust compositionality in Transformers, with implications for model design and the interpretation of empirical failures. The work also discusses caveats, such as asymptotic nature and the potential for memory-time trading and architectural innovations to circumvent these limits, indicating avenues for future research.

Abstract

What are the root causes of hallucinations in large language models (LLMs)? We use Communication Complexity to prove that the Transformer layer is incapable of composing functions (e.g., identify a grandparent of a person in a genealogy) if the domains of the functions are large enough; we show through examples that this inability is already empirically present when the domains are quite small. We also point out that several mathematical tasks that are at the core of the so-called compositional tasks thought to be hard for LLMs are unlikely to be solvable by Transformers, for large enough instances and assuming that certain well accepted conjectures in the field of Computational Complexity are true.

On Limitations of the Transformer Architecture

TL;DR

This paper reframes transformer hallucinations as a fundamental limitation in function composition, proving that a single

-headed self-attention layer cannot reliably compute

when

exceeds a bound set by

. Using communication complexity, it derives probabilistic error bounds and extends the analysis to iterated composition and Chain-of-Thought prompts, showing that CoT requires prompts of size at least

to generalize. It further connects compositional tasks to log-space complexity, arguing that several elementary problems (circuit evaluation, derivability, certain SAT variants) are intractable for multi-layer Transformers unless major complexity class equalities hold (e.g.,

, etc.). The results collectively suggest fundamental, architecture-rooted barriers to robust compositionality in Transformers, with implications for model design and the interpretation of empirical failures. The work also discusses caveats, such as asymptotic nature and the potential for memory-time trading and architectural innovations to circumvent these limits, indicating avenues for future research.

Abstract

Paper Structure (16 sections, 6 theorems, 7 equations, 3 figures)

This paper contains 16 sections, 6 theorems, 7 equations, 3 figures.

Introduction
Preliminary definitions
Transformer.
Function composition.
Information theory.
The Impossibility of Composition
Compositionality and Logarithmic Space
Circuit evaluation:
Derivability
Logical reasoning:
Discussion
Acknowledgment:
Examples
Spatial composition
Temporal composition
...and 1 more sections

Key Result

Theorem 1

Consider a function composition problem with input domain size $|A| = |B| = |C| = n$, and an $H$-headed transformer layer ${\cal L}$ with embedding dimension $d$ and computation precision $p$, and assume that $H(d+1)p < n \log n$. Then ${\cal L}$ cannot solve correctly the function composition probl

Figures (3)

Figure 1: Spatial composition produces incorrect answers
Figure 2: Hallucinations in temporal composition
Figure 3: Hallucinations in relationship composition

Theorems & Definitions (13)

Theorem 1
Lemma 1
proof
Remark 1
proof
Remark 2
Remark 3
Theorem 2
proof
Lemma 2: nisan1991roundsklauck2000quantumyehudayoff2020pointer
...and 3 more

On Limitations of the Transformer Architecture

TL;DR

Abstract

On Limitations of the Transformer Architecture

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (13)