Table of Contents
Fetching ...

From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models

Shubhra Mishra, Gabriel Poesia, Noah D. Goodman

TL;DR

This paper introduces MathCAMPS, a curriculum-grounded, synthetic benchmark of mathematical reasoning built from 44 Common Core standards for K-8. By generating symbolic problems and realizing them as natural-language word problems with cycle-consistency checks, it enables fine-grained analysis of how mathematical reasoning skills emerge during pre-training and respond to instruction tuning in open-weight LLM checkpoints. The study reveals that learning trajectories loosely align with human curricula, skills typically evolve smoothly during training, and robustness to follow-up questions improves over time with nuanced model- and skill-dependent effects from instruction tuning. The framework supports extensive, per-standard and per-grade analyses and offers valuable insights for pre-training, evaluation design, and future extensions to broader mathematical domains. Overall, MathCAMPS provides a scalable, automated way to probe the dynamics of mathematical reasoning in LLMs and to compare human and machine learning trajectories.

Abstract

Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning. But how does this ability evolve during training? We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training. To this end, we construct MathCAMPS, a synthetic dataset of novel mathematical reasoning problems grounded in 44 fine-grained skills taken from the Common Core curriculum from K to 8th grades. In one experiment, we show that mathematical skills are learned during pre-training in an order that measurably correlates with the human-designed curriculum, even though training data are randomly ordered. We also show a detailed analysis of which mathematical abilities benefit from instruction tuning, a widely used post-training method and, in contrast, which skills suffer. Our work paves the way for an empirical understanding of LLM training dynamics in relation to reasoning.

From Next-Token to Mathematics: The Learning Dynamics of Mathematical Reasoning in Language Models

TL;DR

This paper introduces MathCAMPS, a curriculum-grounded, synthetic benchmark of mathematical reasoning built from 44 Common Core standards for K-8. By generating symbolic problems and realizing them as natural-language word problems with cycle-consistency checks, it enables fine-grained analysis of how mathematical reasoning skills emerge during pre-training and respond to instruction tuning in open-weight LLM checkpoints. The study reveals that learning trajectories loosely align with human curricula, skills typically evolve smoothly during training, and robustness to follow-up questions improves over time with nuanced model- and skill-dependent effects from instruction tuning. The framework supports extensive, per-standard and per-grade analyses and offers valuable insights for pre-training, evaluation design, and future extensions to broader mathematical domains. Overall, MathCAMPS provides a scalable, automated way to probe the dynamics of mathematical reasoning in LLMs and to compare human and machine learning trajectories.

Abstract

Large Language Models (LLMs) solely trained on next-token prediction learn to solve a wide range of problems involving mathematical reasoning. But how does this ability evolve during training? We show the first analysis of how mathematical reasoning abilities of several open-weight LLMs develop during pre-training and post-training. To this end, we construct MathCAMPS, a synthetic dataset of novel mathematical reasoning problems grounded in 44 fine-grained skills taken from the Common Core curriculum from K to 8th grades. In one experiment, we show that mathematical skills are learned during pre-training in an order that measurably correlates with the human-designed curriculum, even though training data are randomly ordered. We also show a detailed analysis of which mathematical abilities benefit from instruction tuning, a widely used post-training method and, in contrast, which skills suffer. Our work paves the way for an empirical understanding of LLM training dynamics in relation to reasoning.
Paper Structure (39 sections, 18 figures, 15 tables)

This paper contains 39 sections, 18 figures, 15 tables.

Figures (18)

  • Figure 1: Overview of the MathCAMPS generation pipeline. We start from a grammar (A) that represents problems tied to a Common Core Standard - a specific mathematical ability drawn from a human curriculum. We sample problems in a symbolic form (B), and use a language model to realize it in natural language (C), applying a cycle-consistency where we back-translate the problem into symbolic form and ensure the answer remains the same, validating truthfulness. We also synthesize incremental and counterfactual follow-up problems
  • Figure 2: Model accuracy on problems coming from different grade groups evaluated during training. Each data point corresponds to an LLM checkpoint evaluated on MathCAMPS problems testing skills from the indicated range of grades. Training Progress (X-axis) is measured by percentage of total pre-training tokens seen by the checkpoint. Accuracy is final-answer accuracy on solving the problems with few-shot CoT prompting.
  • Figure 3: Performance on problems of varying number of digits in their final answer, across pre-training checkpoints.
  • Figure 4: Learning dynamics of individual Common Core standards in grades 2 and 7. Full results for all grades can be found in Appendix \ref{['app:learning-dynamics-all-grades']}.
  • Figure 5: Performance for models during training when also asked to answer follow-up questions about each problem. Here, we only consider problems that have at least one associated follow-up question (counterfactual or incremental).
  • ...and 13 more figures