Table of Contents
Fetching ...

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

Haolong Li, Yu Ma, Yinqi Zhang, Chen Ye, Jie Chen

TL;DR

This work tackles the challenge of multi-step mathematical reasoning and extrapolation in LLMs by introducing a novel arithmetical puzzle and a scalable synthetic data pipeline. Using open-LLama-3B with LoRA fine-tuning, the authors demonstrate that increasing synthetic data to 100M samples yields a zero-shot in-domain pass@1 of $0.44$ and out-of-domain pass@1 of $0.33$ (numerical OOD) and $0.35$ (form OOD), indicating notable extrapolation capabilities. The approach relies on a purely symbolic SFT prompt and a verifier to ensure correctness, and shows that data scale improves both training convergence and reasoning depth. While promising, the paper notes that full extrapolation across broader mathematical tasks remains an open challenge, motivating future work on more generalizable reasoning. Overall, the study highlights the potential of synthetic data to enhance multi-step reasoning in LLMs and provides a concrete framework for evaluating extrapolation through well-designed OOD benchmarks.

Abstract

Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.

Exploring Mathematical Extrapolation of Large Language Models with Synthetic Data

TL;DR

This work tackles the challenge of multi-step mathematical reasoning and extrapolation in LLMs by introducing a novel arithmetical puzzle and a scalable synthetic data pipeline. Using open-LLama-3B with LoRA fine-tuning, the authors demonstrate that increasing synthetic data to 100M samples yields a zero-shot in-domain pass@1 of and out-of-domain pass@1 of (numerical OOD) and (form OOD), indicating notable extrapolation capabilities. The approach relies on a purely symbolic SFT prompt and a verifier to ensure correctness, and shows that data scale improves both training convergence and reasoning depth. While promising, the paper notes that full extrapolation across broader mathematical tasks remains an open challenge, motivating future work on more generalizable reasoning. Overall, the study highlights the potential of synthetic data to enhance multi-step reasoning in LLMs and provides a concrete framework for evaluating extrapolation through well-designed OOD benchmarks.

Abstract

Large Language Models (LLMs) have shown excellent performance in language understanding, text generation, code synthesis, and many other tasks, while they still struggle in complex multi-step reasoning problems, such as mathematical reasoning. In this paper, through a newly proposed arithmetical puzzle problem, we show that the model can perform well on multi-step reasoning tasks via fine-tuning on high-quality synthetic data. Experimental results with the open-llama-3B model on three different test datasets show that not only the model can reach a zero-shot pass@1 at 0.44 on the in-domain dataset, it also demonstrates certain generalization capabilities on the out-of-domain datasets. Specifically, this paper has designed two out-of-domain datasets in the form of extending the numerical range and the composing components of the arithmetical puzzle problem separately. The fine-tuned models have shown encouraging performance on these two far more difficult tasks with the zero-shot pass@1 at 0.33 and 0.35, respectively.
Paper Structure (20 sections, 5 figures, 7 tables, 2 algorithms)

This paper contains 20 sections, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: Distributions of $N$ and $X$ for different training set sizes (1M / 10M / 100M samples). $N$ denotes the total number of candidate integers of our puzzle, $X = (X_1, X_2, \ldots, X_N)$ denotes the candidate integers.
  • Figure 2: Distributions of the tokenized prompt and response lengths for different training set sizes (1M / 10M / 100M samples).
  • Figure 3: The training loss and zero-shot pass@1 on ID dataset for different training set sizes (1M / 10M / 100M samples).
  • Figure 4: Cases from the form OOD test dataset. The correct steps are highlighted in green, while the incorrect steps in red. Generally speaking, performance of model fine-tuned with 1M training data is the worst.
  • Figure 5: Visualization of the proposed arithmetical puzzle. Given the candidate integers $3, 6, 7, 51, 58$ and the target integer $4$, the answer is $58-51=7, 6-7=-1, 3\times(-1)=-3, -3+7=4$.