Table of Contents
Fetching ...

Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees

Alessandro Breccia, Federica Gerace, Marco Lippi, Gabriele Sicuro, Pierluigi Contucci

TL;DR

The paper investigates whether a GPT‑2–style transformer can learn the deterministic arithmetic text mathds{N}mathcal{T}, derived from rooted‑tree prime factorizations encoded as Dyck words. It trains a 12‑layer GPT‑2 model on the first {10}^{11} integers with two self‑supervised tasks: Next‑Word Prediction and Masked Language Modeling, and compares against a Markov baseline. Results show partial learning of the internal grammar, with the model outperforming baselines and capturing non‑trivial regularities, though prime boundaries remain challenging due to long‑range structure beyond the context window. The work suggests that arithmetic structure can be learned to an extent by transformers and motivates larger models to explore global reasoning and latent representations of number theory.

Abstract

We study whether a Large Language Model can learn the deterministic sequence of trees generated by the iterated prime factorization of the natural numbers. Each integer is mapped into a rooted planar tree and the resulting sequence $ \mathbb{N}\mathcal{T}$ defines an arithmetic text with measurable statistical structure. A transformer network (the GPT-2 architecture) is trained from scratch on the first $10^{11}$ elements to subsequently test its predictive ability under next-word and masked-word prediction tasks. Our results show that the model partially learns the internal grammar of $\mathbb{N}\mathcal{T}$, capturing non-trivial regularities and correlations. This suggests that learnability may extend beyond empirical data to the very structure of arithmetic.

Testing Transformer Learnability on the Arithmetic Sequence of Rooted Trees

TL;DR

The paper investigates whether a GPT‑2–style transformer can learn the deterministic arithmetic text mathds{N}mathcal{T}, derived from rooted‑tree prime factorizations encoded as Dyck words. It trains a 12‑layer GPT‑2 model on the first {10}^{11} integers with two self‑supervised tasks: Next‑Word Prediction and Masked Language Modeling, and compares against a Markov baseline. Results show partial learning of the internal grammar, with the model outperforming baselines and capturing non‑trivial regularities, though prime boundaries remain challenging due to long‑range structure beyond the context window. The work suggests that arithmetic structure can be learned to an extent by transformers and motivates larger models to explore global reasoning and latent representations of number theory.

Abstract

We study whether a Large Language Model can learn the deterministic sequence of trees generated by the iterated prime factorization of the natural numbers. Each integer is mapped into a rooted planar tree and the resulting sequence defines an arithmetic text with measurable statistical structure. A transformer network (the GPT-2 architecture) is trained from scratch on the first elements to subsequently test its predictive ability under next-word and masked-word prediction tasks. Our results show that the model partially learns the internal grammar of , capturing non-trivial regularities and correlations. This suggests that learnability may extend beyond empirical data to the very structure of arithmetic.

Paper Structure

This paper contains 14 sections, 17 equations, 8 figures.

Figures (8)

  • Figure 1: On the left, attention map at single layer level for a single head attention in an NLP setting: color represent the intensity of the attention value produced on the sentence words by focusing on word it. On the right, attention map for a multi-head attention in our setting corresponding to the 'word' 1100 highlighted in gray. In particular, on the right column each element of the arithmetic sequence is associated to a vector of colorbars, each representing a different head attention: the intensity of the color is proportional to the attention weight produced by the corresponding attention head.
  • Figure 2: Loss curves for a model trained via NWP, obtained for different tokenizer's dictionary size $D$, namely $D=64$, $D=256$, $D=1024$. The loss value is rescaled by $\ln D$, loss value associated to a uniform distribution.
  • Figure 3: Accuracy for words ($A^w$, left) and Kullback-Leibler divergence for word distribution ($\mathrm{KL}^w$, right) are reported at different temperature values $T$, averaged over a set of 32 different inputs for each $T$. The green dotted line represents the Markov Chain (MC) model baseline, while the black dash-dotted line identifies the large temperature limit.
  • Figure 4: Precision, recall and $F_1$ score for a set of $L=1024$ generated tokens. The $x$-axis is ordered by words' frequencies in the input sentences, namely the true sequence. On the left, result for our LLM model; on the right, corresponding quantities for the Markov baseline.
  • Figure 5: Distributions of the generated Dyck words at real prime positions (left) and of real words at predicted prime positions (right), averaged over a set of 10 generations of 1024 tokens. The LLM output (top row) is compared with a simple Markov model prediction (bottom row), showing better performances in both cases. Bars with label the string 10 correspond to correct predictions.
  • ...and 3 more figures