Learning the greatest common divisor: explaining transformer predictions

François Charton

Learning the greatest common divisor: explaining transformer predictions

François Charton

TL;DR

This work shows that small transformers can learn to compute the greatest common divisor with explanations grounded in number theory. By representing integers in a base-$B$ encoding and training on carefully chosen input distributions, the models internalize a sieve-like algorithm that predicts the largest learned divisor of $k=\gcd(a,b)$, effectively clustering inputs by gcd. Key contributions include the identification of a learnable set $\mathcal{D}$ of divisors, demonstration of deterministic predictions for each gcd under certain training regimes, and a detailed analysis of how base choice, operand/outcome distributions, and model architecture shape learning and explainability. The results reveal that log-uniform operand distributions (and, when beneficial, log-uniform outcomes) enable rapid learning and high accuracy (often $>90\%$ for many gcds), while uniform-outcome training can erode explainability, highlighting how training data design directly impacts both performance and interpretability in mathematical tasks.

Abstract

The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.

Learning the greatest common divisor: explaining transformer predictions

TL;DR

This work shows that small transformers can learn to compute the greatest common divisor with explanations grounded in number theory. By representing integers in a base-

encoding and training on carefully chosen input distributions, the models internalize a sieve-like algorithm that predicts the largest learned divisor of

, effectively clustering inputs by gcd. Key contributions include the identification of a learnable set

of divisors, demonstration of deterministic predictions for each gcd under certain training regimes, and a detailed analysis of how base choice, operand/outcome distributions, and model architecture shape learning and explainability. The results reveal that log-uniform operand distributions (and, when beneficial, log-uniform outcomes) enable rapid learning and high accuracy (often

for many gcds), while uniform-outcome training can erode explainability, highlighting how training data design directly impacts both performance and interpretability in mathematical tasks.

Abstract

of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of

that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to

GCD

). Log-uniform operands boost performance to

GCD

, and a log-uniform distribution of outcomes (i.e. GCD) to

. However, training from uniform (balanced) GCD breaks explainability.

Paper Structure (20 sections, 1 equation, 7 figures, 23 tables)

This paper contains 20 sections, 1 equation, 7 figures, 23 tables.

Introduction
Experimental settings
Learning the greatest common divisor - Base experiments
Large composite bases $B$ - grokking small primes
Learning from log-uniform operands
Learning from uniform outcomes
Discussion
Rational arithmetic with transformers
Model scaling for the base experiments
Theoretical values of accuracy
Additional experiments
Experiments with outcome distributions
Learning with smaller batches
Uniform operands and outcomes - base 1000
Uniform outcomes - Larger bases
...and 5 more sections

Figures (7)

Figure 1: Correct GCD vs training time. Natural ($\frac{1}{k^2}$) distribution of GCD.
Figure 2: Correct GCD vs training time. 5% uniform, 95% natural GCD.
Figure 3: Learning curves for B=10. Uniform outcomes and operands. 3 different seeds.
Figure 4: Learning curves for B=1000 - uniform operands and outcomes.
Figure 5: Learning curves for base B=2023. 3 different model initializations.
...and 2 more figures

Learning the greatest common divisor: explaining transformer predictions

TL;DR

Abstract

Learning the greatest common divisor: explaining transformer predictions

Authors

TL;DR

Abstract

Table of Contents

Figures (7)