Why Can't Transformers Learn Multiplication? Reverse-Engineering Reveals Long-Range Dependency Pitfalls
Xiaoyan Bai, Itamar Pres, Yuntian Deng, Chenhao Tan, Stuart Shieber, Fernanda Viégas, Martin Wattenberg, Andrew Lee
TL;DR
The paper tackles why Transformers struggle with multi-digit multiplication, showing that a model trained with implicit chain-of-thought (ICoT) learns essential long-range dependencies that standard fine-tuning (SFT) lacks. By reverse-engineering ICoT, the authors demonstrate that attention organizes into a sparse, binary-tree-like graph that caches and retrieves pairwise partial products, while digits are represented with Fourier bases, yielding a pentagonal-prism geometry unseen in SFT. They formalize the key intermediate $\hat{c}_k = s_k + r_{k-1}$ with $s_k = \sum_{i+j=k} a_i b_j$ and $c_k = \hat{c}_k \bmod 10$, and show that a simple auxiliary loss predicting the running sum $\hat{c}_k$ provides the inductive bias needed for SFT to succeed on 4$\times$4-digit multiplication. This work highlights a fundamental pitfall in gradient-descent learning for long-range tasks and suggests that task-specific inductive biases can unlock robust long-range reasoning in Transformer models, informing future approaches to arithmetic and other long-horizon commands.
Abstract
Language models are increasingly capable, yet still fail at a seemingly simple task of multi-digit multiplication. In this work, we study why, by reverse-engineering a model that successfully learns multiplication via \emph{implicit chain-of-thought}, and report three findings: (1) Evidence of long-range structure: Logit attributions and linear probes indicate that the model encodes the necessary long-range dependencies for multi-digit multiplication. (2) Mechanism: the model encodes long-range dependencies using attention to construct a directed acyclic graph to ``cache'' and ``retrieve'' pairwise partial products. (3) Geometry: the model implements partial products in attention heads by forming Minkowski sums between pairs of digits, and digits are represented using a Fourier basis, both of which are intuitive and efficient representations that the standard fine-tuning model lacks. With these insights, we revisit the learning dynamics of standard fine-tuning and find that the model converges to a local optimum that lacks the required long-range dependencies. We further validate this understanding by introducing an auxiliary loss that predicts the ``running sum'' via a linear regression probe, which provides an inductive bias that enables the model to successfully learn multi-digit multiplication. In summary, by reverse-engineering the mechanisms of an implicit chain-of-thought model we uncover a pitfall for learning long-range dependencies in Transformers and provide an example of how the correct inductive bias can address this issue.
