Table of Contents
Fetching ...

Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

Feiyang Kang, Newsha Ardalani, Michael Kuchnik, Youssef Emad, Mostafa Elhoushi, Shubhabrata Sengupta, Shang-Wen Li, Ramya Raghavendra, Ruoxi Jia, Carole-Jean Wu

TL;DR

The paper systematically interrogates the role of synthetic data in foundational LLM pre-training through a large-scale, unified empirical study. It compares natural web data with two synthetic paradigms—web rephrasing and textbook-style generation—and analyzes mixtures under scaling laws to assess data-budget and model-size effects. The results show that mixing around 30% synthetic data with natural data can substantially accelerate pre-training for rephrased data, while pure synthetic approaches often underperform, and textbook-style data can induce degradation or collapse-like patterns depending on regime. The study provides practical guidance for deploying synthetic data, highlights conditional benefits, and emphasizes the need for further validation at frontier scales and with dynamic mixing strategies to maximize downstream capabilities.

Abstract

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data \textit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data \textit{alone} results in notably higher loss on many downstream domains especially at small data budgets. "Good" ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on "model collapse" during large-scale single-round (n=1) model training on synthetic data--training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by "model collapse". Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.

Demystifying Synthetic Data in LLM Pre-training: A Systematic Study of Scaling Laws, Benefits, and Pitfalls

TL;DR

The paper systematically interrogates the role of synthetic data in foundational LLM pre-training through a large-scale, unified empirical study. It compares natural web data with two synthetic paradigms—web rephrasing and textbook-style generation—and analyzes mixtures under scaling laws to assess data-budget and model-size effects. The results show that mixing around 30% synthetic data with natural data can substantially accelerate pre-training for rephrased data, while pure synthetic approaches often underperform, and textbook-style data can induce degradation or collapse-like patterns depending on regime. The study provides practical guidance for deploying synthetic data, highlights conditional benefits, and emphasizes the need for further validation at frontier scales and with dynamic mixing strategies to maximize downstream capabilities.

Abstract

Training data plays a crucial role in Large Language Models (LLM) scaling, yet high quality data is of limited supply. Synthetic data techniques offer a potential path toward sidestepping these limitations. We conduct a large-scale empirical investigation (>1000 LLMs with >100k GPU hours) using a unified protocol and scaling laws, comparing natural web data, diverse synthetic types (rephrased text, generated textbooks), and mixtures of natural and synthetic data. Specifically, we found pre-training on rephrased synthetic data \textit{alone} is not faster than pre-training on natural web texts; while pre-training on 1/3 rephrased synthetic data mixed with 2/3 natural web texts can speed up 5-10x (to reach the same validation loss) at larger data budgets. Pre-training on textbook-style synthetic data \textit{alone} results in notably higher loss on many downstream domains especially at small data budgets. "Good" ratios of synthetic data in training data mixtures depend on the model size and data budget, empirically converging to ~30% for rephrased synthetic data. Larger generator models do not necessarily yield better pre-training data than ~8B-param models. These results contribute mixed evidence on "model collapse" during large-scale single-round (n=1) model training on synthetic data--training on rephrased synthetic data shows no degradation in performance in foreseeable scales whereas training on mixtures of textbook-style pure-generated synthetic data shows patterns predicted by "model collapse". Our work demystifies synthetic data in pre-training, validates its conditional benefits, and offers practical guidance.

Paper Structure

This paper contains 60 sections, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Data Scaling. Left: Validation of the data scaling formula. Predictions for 200B tokens (fitted using up to 100B tokens) achieve an RMABE of 0.41%. Solid dots display actual loss values while the fitted curves shows predicted loss. Validation datapoints are illustrated by diamond marks. Right: Extrapolated data scaling performance for 1B-parameter models across various data mixtures.
  • Figure 2: Model Scaling. Left: Validation of the model scaling formula. Predictions for 3B-parameter models (fitted using up to 2B-parameter models) achieve an RMABE of 0.30% on validation datapoints illustrated with diamond marks. Solid dots display actual loss values while the fitted curves shows predicted loss. Right: Extrapolated model scaling performance for training on 50B tokens across various data mixtures.
  • Figure 3: Estimated irreducible loss ($E$) for different data mixtures. Lower values are better.
  • Figure 4: Best-found mixture ratios (percentage of synthetic data with CommonCrawl) from grid search for HQ (Left), QA (Middle), and TXBK (Right) synthetic data types across different model sizes and data budgets. Best-found ratios are all below $50\%$ appear to converge $\sim30\%$.
  • Figure 5: Generator model capability ablation. Compares validation loss of 1B-parameter models trained for trained for 5B tokens using mixtures of HQ/QA rephrased data from Llama3-3B/8B/70B generators with CommonCrawl. The percentage of synthetic data in these mixtures was varied across seven exponentially spaced points from 0.5% to 20%.
  • ...and 2 more figures