Transformers know more than they can tell -- Learning the Collatz sequence
François Charton, Ashvni Narayanan
TL;DR
The paper investigates how transformers learn the long Collatz step, a two-loop arithmetic process, by predicting the long Collatz successor $\kappa(n)$ from odd inputs $n$ encoded in various bases. It reveals a universal learning pattern: inputs segregate into binary-residue classes tied to loop lengths $k$ and $k'$, with near-perfect accuracy on learned classes across bases, and principled error modes when loop lengths are misestimated. A theoretical link shows $k$ and $k'$ are read from binary representations and matched to sequences $S_l$, offering a partial explainability of the model’s behavior. Ablations demonstrate the limits of base conversion and the sensitivity to input distributions, while arguing that the main difficulty lies in mastering loop control rather than arithmetic itself. Overall, the work presents a math-grounded methodology for understanding and explaining how transformers learn complex algorithms, with implications for mathematical discovery and robust interpretability.
Abstract
We investigate transformer prediction of long Collatz steps, a complex arithmetic function that maps odd integers to their distant successors in the Collatz sequence ( $u_{n+1}=u_n/2$ if $u_n$ is even, $u_{n+1}=(3u_n+1)/2$ if $u_n$ is odd). Model accuracy varies with the base used to encode input and output. It can be as high as $99.7\%$ for bases $24$ and $32$, and as low as $37$ and $25\%$ for bases $11$ and $3$. Yet, all models, no matter the base, follow a common learning pattern. As training proceeds, they learn a sequence of classes of inputs that share the same residual modulo $2^p$. Models achieve near-perfect accuracy on these classes, and less than $1\%$ for all other inputs. This maps to a mathematical property of Collatz sequences: the length of the loops involved in the computation of a long Collatz step can be deduced from the binary representation of its input. The learning pattern reflects the model learning to predict inputs associated with increasing loop lengths. An analysis of failure cases reveals that almost all model errors follow predictable patterns. Hallucination, a common feature of large language models, almost never happens. In over $90\%$ of failures, the model performs the correct calculation, but wrongly estimates loop lengths. Our observations give a full account of the algorithms learned by the models. They suggest that the difficulty of learning such complex arithmetic function lies in figuring the control structure of the computation -- the length of the loops. We believe that the approach outlined here, using mathematical problems as tools for understanding, explaining, and perhaps improving language models, can be applied to a broad range of problems and bear fruitful results.
