Table of Contents
Fetching ...

One-layer transformers fail to solve the induction heads task

Clayton Sanford, Daniel Hsu, Matus Telgarsky

TL;DR

This work proves a size lower bound for one-layer transformers on the induction heads task, showing that solving the task requires $h m p = \Omega(n)$, via a reduction from the one-way INDEX communication problem. The result highlights an exponential efficiency gap between one-layer and two-layer transformers, since a two-layer solution exists with $h=O(1)$, $m=O(1)$, $p=O(\log n)$ when the input alphabet satisfies $|\Sigma|\le n$. The findings formalize a fundamental limitation of shallow transformer architectures for inductive reasoning tasks, under a precision-based size measure, and connect to prior work on the role of induction heads in composition."

Abstract

A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.

One-layer transformers fail to solve the induction heads task

TL;DR

This work proves a size lower bound for one-layer transformers on the induction heads task, showing that solving the task requires , via a reduction from the one-way INDEX communication problem. The result highlights an exponential efficiency gap between one-layer and two-layer transformers, since a two-layer solution exists with , , when the input alphabet satisfies . The findings formalize a fundamental limitation of shallow transformer architectures for inductive reasoning tasks, under a precision-based size measure, and connect to prior work on the role of induction heads in composition."

Abstract

A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.
Paper Structure (4 sections, 1 theorem, 15 equations)

This paper contains 4 sections, 1 theorem, 15 equations.

Key Result

Theorem 1

If a one-layer transformer with $h$ self-attention heads, embedding dimension $m$, and $p$ bits of precision solves the induction heads task for input sequences of length $n$ over a three-symbol alphabet, then $hmp = \Omega(n)$.

Theorems & Definitions (2)

  • Theorem 1
  • proof