Table of Contents
Fetching ...

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

Dawid J. Kopiczko, Sagar Vaze, Tijmen Blankevoort, Yuki M. Asano

TL;DR

This work shows that, for long-CoT supervised fine-tuning, repeating the same demonstrations (many epochs on small datasets) under a fixed update budget can outperform training on a larger set for fewer epochs. Training token accuracy serves as a practical indicator of when to stop epoch scaling, as gains saturate near full memorization without additional catastrophic forgetting. The repetition advantage persists across models, benchmarks, teacher qualities, and even when training on negative trajectories, though the underlying causal mechanism remains open. The findings offer actionable guidance for compute-efficient reasoning SFT and frame an open problem around why memorization through repetition yields improved generalization in reasoning tasks.

Abstract

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.

Data Repetition Beats Data Scaling in Long-CoT Supervised Fine-Tuning

TL;DR

This work shows that, for long-CoT supervised fine-tuning, repeating the same demonstrations (many epochs on small datasets) under a fixed update budget can outperform training on a larger set for fewer epochs. Training token accuracy serves as a practical indicator of when to stop epoch scaling, as gains saturate near full memorization without additional catastrophic forgetting. The repetition advantage persists across models, benchmarks, teacher qualities, and even when training on negative trajectories, though the underlying causal mechanism remains open. The findings offer actionable guidance for compute-efficient reasoning SFT and frame an open problem around why memorization through repetition yields improved generalization in reasoning tasks.

Abstract

Supervised fine-tuning (SFT) on chain-of-thought data is an essential post-training step for reasoning language models. Standard machine learning intuition suggests that training with more unique training samples yields better generalization. Counterintuitively, we show that SFT benefits from repetition: under a fixed update budget, training for more epochs on smaller datasets outperforms single-epoch training on larger datasets. On AIME'24/25 and GPQA benchmarks, Olmo3-7B trained for 128 epochs on 400 samples outperforms the equivalent 1 epoch on 51200 samples by 12-26 percentage points, with no additional catastrophic forgetting. We find that training token accuracy reliably signals when repetition has saturated; improvements from additional epochs plateau at full memorization, a pattern consistent across all settings. These findings provide a practical approach for reasoning SFT, where scaling epochs with token accuracy as a stopping criterion can replace expensive undirected data scaling. We pose the repetition advantage, where full memorization coincides with improved generalization, as a new open problem for the community in understanding the training dynamics of large language models.
Paper Structure (28 sections, 1 equation, 13 figures, 4 tables)

This paper contains 28 sections, 1 equation, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Illustration of our approach to supervised fine-tuning in a modern LLM training pipeline. Instead of maximizing dataset size and training for few epochs, we train for many epochs on a small random subset of SFT data, substantially reducing compute while improving downstream reasoning performance.
  • Figure 2: Scaling epochs versus scaling data for Olmo3-7B trained on long-CoT SFT data, averaged across AIME'24, AIME'25, and GPQA benchmarks. Each diagonal represents a fixed update budget, where epochs × samples is constant. Within any diagonal, moving toward fewer samples and more epochs consistently improves accuracy and pass@n, with gains diminishing around 32–64 epochs. Termination rate correlates strongly with accuracy and may be a primary driver of performance gains, as models that fail to terminate cannot produce a final answer.
  • Figure 3: The repetition advantage is consistent across models, benchmarks, and evaluation metrics. Heatmaps show normalized scores for Olmo3-7B (top) and Qwen3-8B (bottom) on AIME'24, AIME'25, and GPQA, evaluated with both Accuracy@$n$ and Pass@$n$. Each diagonal corresponds to a fixed update budget (epochs $\times$ samples), and in all settings, performance improves when moving along a diagonal toward fewer samples and more epochs.
  • Figure 4: Relationship between training set memorization and downstream performance for Olmo3-7B. Points are colored by epoch count; within each epoch group, variation reflects different dataset sizes. Token accuracy on train set increases primarily with epochs rather than total updates. Across all benchmarks, performance gains plateau once models approach full memorization, suggesting that token accuracy can serve as a stopping criterion for epoch scaling. The initial token accuracy of the base model is marked with the vertical line.
  • Figure 5: Training dynamics for Olmo3-7B showing the relationship between loss, entropy, and downstream performance averaged over AIME'24, AIME'25, and GPQA. Points are colored by epoch count; within each group, variation reflects dataset size. As epochs increase, train loss approaches zero while validation loss rises, the classical signature of overfitting in terms of the train-validation gap. Prediction entropy also decreases, showing increased model confidence in predictions that diverge from the validation distribution. Despite these indicators, downstream accuracy improves with epoch count. Vertical lines mark base model metrics.
  • ...and 8 more figures