One-layer transformers fail to solve the induction heads task
Clayton Sanford, Daniel Hsu, Matus Telgarsky
TL;DR
This work proves a size lower bound for one-layer transformers on the induction heads task, showing that solving the task requires $h m p = \Omega(n)$, via a reduction from the one-way INDEX communication problem. The result highlights an exponential efficiency gap between one-layer and two-layer transformers, since a two-layer solution exists with $h=O(1)$, $m=O(1)$, $p=O(\log n)$ when the input alphabet satisfies $|\Sigma|\le n$. The findings formalize a fundamental limitation of shallow transformer architectures for inductive reasoning tasks, under a precision-based size measure, and connect to prior work on the role of induction heads in composition."
Abstract
A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.
