Learning the greatest common divisor: explaining transformer predictions
François Charton
TL;DR
This work shows that small transformers can learn to compute the greatest common divisor with explanations grounded in number theory. By representing integers in a base-$B$ encoding and training on carefully chosen input distributions, the models internalize a sieve-like algorithm that predicts the largest learned divisor of $k=\gcd(a,b)$, effectively clustering inputs by gcd. Key contributions include the identification of a learnable set $\mathcal{D}$ of divisors, demonstration of deterministic predictions for each gcd under certain training regimes, and a detailed analysis of how base choice, operand/outcome distributions, and model architecture shape learning and explainability. The results reveal that log-uniform operand distributions (and, when beneficial, log-uniform outcomes) enable rapid learning and high accuracy (often $>90\%$ for many gcds), while uniform-outcome training can erode explainability, highlighting how training data design directly impacts both performance and interpretability in mathematical tasks.
Abstract
The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.
