Table of Contents
Fetching ...

Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

Y. Du, G. Wu, G. Tang, W. Wang, Q. Fan

TL;DR

This work provides a controlled empirical map of how synthetic data proportions affect language model training across model scales, tasks, and iteration horizons. By training 410M–12B Pythia models on 0-50% synthetic data for up to three iterations across five tasks, the authors quantify performance, calibration, and output changes, establishing a safe operating zone and scale-aware budgets. Key findings show that performance remains stable up to about 20% synthetic data, calibration degrades before accuracy, larger models tolerate more synthetic content, and task type strongly modulates degradation. The results validate current best practices that employ modest synthetic data and reveal actionable guidance for practitioners, including calibration-based monitoring and task-aware budgeting, while highlighting areas for future research in larger-scale models and longer-horizon training.

Abstract

Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.

Characterizing Model Behavior Under Synthetic Data Training: An Empirical Study Across Scales and Mixing Ratios

TL;DR

This work provides a controlled empirical map of how synthetic data proportions affect language model training across model scales, tasks, and iteration horizons. By training 410M–12B Pythia models on 0-50% synthetic data for up to three iterations across five tasks, the authors quantify performance, calibration, and output changes, establishing a safe operating zone and scale-aware budgets. Key findings show that performance remains stable up to about 20% synthetic data, calibration degrades before accuracy, larger models tolerate more synthetic content, and task type strongly modulates degradation. The results validate current best practices that employ modest synthetic data and reveal actionable guidance for practitioners, including calibration-based monitoring and task-aware budgeting, while highlighting areas for future research in larger-scale models and longer-horizon training.

Abstract

Synthetic data generated by large language models has become integral to modern NLP training pipelines, from bootstrapping reasoning capabilities to augmenting instruction-following datasets. While recent work demonstrates successful applications maintaining high external data ratios, systematic understanding of how synthetic data proportion affects model behavior across different scales remains limited. This paper presents a controlled empirical study examining model performance, calibration, and output characteristics when trained on varying synthetic-to-external data ratios. Using the Pythia model suite (410M-12B parameters) across five diverse tasks, we evaluate models after one to three training iterations with synthetic data proportions ranging from 0-50\%. Our key findings include: models maintain stable performance with up to 20\% synthetic data, but degradation accelerates beyond 30\%; larger models (6.9B-12B) show greater robustness to synthetic data than smaller models (410M-1.4B); calibration degradation precedes accuracy loss, providing an early warning signal; and task characteristics matter, with reasoning tasks degrading faster than retrieval tasks under synthetic data training. Importantly, we find that current best practices, such as those employed in STaR and Self-Instruct systems that maintain greater than 80\% external data, operate well within safe regimes identified by our experiments. We provide practical guidance for practitioners on synthetic data budgets based on model scale and task requirements, alongside detailed comparison with concurrent work including Shumailov et al.'s model collapse findings.

Paper Structure

This paper contains 33 sections, 3 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: Calibration degrades before accuracy when training with synthetic data, providing early warning signal. Expected Calibration Error (ECE) increases substantially starting from Iteration 0, while accuracy remains stable until Iteration 1, then begins declining. This leading indicator relationship enables proactive monitoring before performance visibly degrades.