One-layer transformers fail to solve the induction heads task

Clayton Sanford; Daniel Hsu; Matus Telgarsky

One-layer transformers fail to solve the induction heads task

Clayton Sanford, Daniel Hsu, Matus Telgarsky

TL;DR

This work proves a size lower bound for one-layer transformers on the induction heads task, showing that solving the task requires $h m p = \Omega(n)$, via a reduction from the one-way INDEX communication problem. The result highlights an exponential efficiency gap between one-layer and two-layer transformers, since a two-layer solution exists with $h=O(1)$, $m=O(1)$, $p=O(\log n)$ when the input alphabet satisfies $|\Sigma|\le n$. The findings formalize a fundamental limitation of shallow transformer architectures for inductive reasoning tasks, under a precision-based size measure, and connect to prior work on the role of induction heads in composition."

Abstract

A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.

One-layer transformers fail to solve the induction heads task

TL;DR

This work proves a size lower bound for one-layer transformers on the induction heads task, showing that solving the task requires

, via a reduction from the one-way INDEX communication problem. The result highlights an exponential efficiency gap between one-layer and two-layer transformers, since a two-layer solution exists with

when the input alphabet satisfies

. The findings formalize a fundamental limitation of shallow transformer architectures for inductive reasoning tasks, under a precision-based size measure, and connect to prior work on the role of induction heads in composition."

Abstract

Paper Structure (4 sections, 1 theorem, 15 equations)

This paper contains 4 sections, 1 theorem, 15 equations.

Introduction
Transformer model
Size of one-layer transformers for the induction heads task
Precision details

Key Result

Theorem 1

If a one-layer transformer with $h$ self-attention heads, embedding dimension $m$, and $p$ bits of precision solves the induction heads task for input sequences of length $n$ over a three-symbol alphabet, then $hmp = \Omega(n)$.

Theorems & Definitions (2)

Theorem 1
proof

One-layer transformers fail to solve the induction heads task

TL;DR

Abstract

One-layer transformers fail to solve the induction heads task

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (2)