From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models
Shubhra Mishra, Gabriel Poesia, Noah D. Goodman
TL;DR
This paper introduces MathCAMPS, a curriculum-grounded, synthetic benchmark of mathematical reasoning built from 44 Common Core standards for K-8. By generating symbolic problems and realizing them as natural-language word problems with cycle-consistency checks, it enables fine-grained analysis of how mathematical reasoning skills emerge during pre-training and respond to instruction tuning in open-weight LLM checkpoints. The study reveals that learning trajectories loosely align with human curricula, skills typically evolve smoothly during training, and robustness to follow-up questions improves over time with nuanced model- and skill-dependent effects from instruction tuning. The framework supports extensive, per-standard and per-grade analyses and offers valuable insights for pre-training, evaluation design, and future extensions to broader mathematical domains. Overall, MathCAMPS provides a scalable, automated way to probe the dynamics of mathematical reasoning in LLMs and to compare human and machine learning trajectories.
Abstract
Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning. But how does this ability evolve during training? We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training. To this end, we construct MathCAMPS, a synthetic dataset of novel mathematical reasoning problems grounded in 44 fine-grained skills taken from the Common Core curriculum from K to 8th grades. In one experiment, we show that mathematical skills are learned during pre-training in an order that measurably correlates with the human-designed curriculum, even though training data are randomly ordered. We also show a detailed analysis of which mathematical abilities benefit from instruction tuning, a widely used post-training method and, in contrast, which skills suffer. Our work paves the way for an empirical understanding of LLM training dynamics in relation to reasoning.
