Table of Contents
Fetching ...

Learning the greatest common divisor: explaining transformer predictions

François Charton

TL;DR

This work shows that small transformers can learn to compute the greatest common divisor with explanations grounded in number theory. By representing integers in a base-$B$ encoding and training on carefully chosen input distributions, the models internalize a sieve-like algorithm that predicts the largest learned divisor of $k=\gcd(a,b)$, effectively clustering inputs by gcd. Key contributions include the identification of a learnable set $\mathcal{D}$ of divisors, demonstration of deterministic predictions for each gcd under certain training regimes, and a detailed analysis of how base choice, operand/outcome distributions, and model architecture shape learning and explainability. The results reveal that log-uniform operand distributions (and, when beneficial, log-uniform outcomes) enable rapid learning and high accuracy (often $>90\%$ for many gcds), while uniform-outcome training can erode explainability, highlighting how training data design directly impacts both performance and interpretability in mathematical tasks.

Abstract

The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list $\mathcal D$ of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of $\mathcal D$ that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to $38$ GCD $\leq100$). Log-uniform operands boost performance to $73$ GCD $\leq 100$, and a log-uniform distribution of outcomes (i.e. GCD) to $91$. However, training from uniform (balanced) GCD breaks explainability.

Learning the greatest common divisor: explaining transformer predictions

TL;DR

This work shows that small transformers can learn to compute the greatest common divisor with explanations grounded in number theory. By representing integers in a base- encoding and training on carefully chosen input distributions, the models internalize a sieve-like algorithm that predicts the largest learned divisor of , effectively clustering inputs by gcd. Key contributions include the identification of a learnable set of divisors, demonstration of deterministic predictions for each gcd under certain training regimes, and a detailed analysis of how base choice, operand/outcome distributions, and model architecture shape learning and explainability. The results reveal that log-uniform operand distributions (and, when beneficial, log-uniform outcomes) enable rapid learning and high accuracy (often for many gcds), while uniform-outcome training can erode explainability, highlighting how training data design directly impacts both performance and interpretability in mathematical tasks.

Abstract

The predictions of small transformers, trained to calculate the greatest common divisor (GCD) of two positive integers, can be fully characterized by looking at model inputs and outputs. As training proceeds, the model learns a list of integers, products of divisors of the base used to represent integers and small primes, and predicts the largest element of that divides both inputs. Training distributions impact performance. Models trained from uniform operands only learn a handful of GCD (up to GCD ). Log-uniform operands boost performance to GCD , and a log-uniform distribution of outcomes (i.e. GCD) to . However, training from uniform (balanced) GCD breaks explainability.
Paper Structure (20 sections, 1 equation, 7 figures, 23 tables)

This paper contains 20 sections, 1 equation, 7 figures, 23 tables.

Figures (7)

  • Figure 1: Correct GCD vs training time. Natural ($\frac{1}{k^2}$) distribution of GCD.
  • Figure 2: Correct GCD vs training time. 5% uniform, 95% natural GCD.
  • Figure 3: Learning curves for B=10. Uniform outcomes and operands. 3 different seeds.
  • Figure 4: Learning curves for B=1000 - uniform operands and outcomes.
  • Figure 5: Learning curves for base B=2023. 3 different model initializations.
  • ...and 2 more figures