Table of Contents
Fetching ...

How to Synthesize Text Data without Model Collapse?

Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou

TL;DR

The paper studies how synthetic data affects language-model training and shows that non-iterative mixing of synthetic and human data degrades performance due to distributional gaps and n-gram over-concentration. It introduces token-level editing (ToEdit), a prior-guided semi-synthetic data approach that preserves the source distribution while enhancing data quality, and proves a finite upper bound on test error to prevent model collapse. The authors validate ToEdit across pre-training from scratch, continual pre-training, and supervised fine-tuning on multiple models and domains, demonstrating consistent improvements without increasing data size. The work provides a practical data-synthesis strategy that mitigates collapse risk, with broad implications for the use of synthetic data in next-generation GPT-n-style models.

Abstract

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves model performance.

How to Synthesize Text Data without Model Collapse?

TL;DR

The paper studies how synthetic data affects language-model training and shows that non-iterative mixing of synthetic and human data degrades performance due to distributional gaps and n-gram over-concentration. It introduces token-level editing (ToEdit), a prior-guided semi-synthetic data approach that preserves the source distribution while enhancing data quality, and proves a finite upper bound on test error to prevent model collapse. The authors validate ToEdit across pre-training from scratch, continual pre-training, and supervised fine-tuning on multiple models and domains, demonstrating consistent improvements without increasing data size. The work provides a practical data-synthesis strategy that mitigates collapse risk, with broad implications for the use of synthetic data in next-generation GPT-n-style models.

Abstract

Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT- models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves model performance.

Paper Structure

This paper contains 49 sections, 3 theorems, 34 equations, 14 figures, 18 tables, 1 algorithm.

Key Result

Theorem 1

In the data editing setting, $\forall n \geq 1$, the fitted linear parameters $\hat{w}_{n+1}$ can be derived as: where, $w^*$ is the true parameter, $X$ is the original design matrix, $E_i$ is the extra noise added at the $i$'th iteration, and $M_i$ is an idempotent diagonal matrix, defining the editing operation.

Figures (14)

  • Figure 1: Model collapse of synthetic data. ① The model continuously trains on its previously generated data, leading to a gradual decline in model performance, i.e., model collapse. Starting from real data $Data_0$, the test error $E_{test}$ increases as $f_0$ undergoes iterative training on synthetic data $Data_{>0}$. ② ToEdit (ours), we use a trained model for token-level editing rather than purely synthesizing data. Leveraging $f_0$ and an operation matrix $M_i$ to edit the data, the test error is constrained within a fixed upper bound. Therefore, we can preserve the distribution coverage to avoid model collapse.
  • Figure 2: Non-iterative model collapse. Training language models from scratch on AI-synthesized data or a mixture of human and synthetic data leads to performance degradation. This degradation is negatively correlated with the proportion of synthetic data used in training. Setting: We pre-train GPT-2 Small (124M) on human data (Dolma dolma) and synthetic data (Cosmopedia benallal2024cosmopedia) and evaluate the PPL on the Paloma benchmark Magnusson2023PalomaAB. Training loss in Figure \ref{['fig:training_loss_synthetic_data']}. Further validations on 22 subdomains and general downstream tasks are presented in Table \ref{['tab:ppl_results_of_pile']} and Table \ref{['tab:human_vs_synthetic_downstream_tasks']}, respectively.
  • Figure 3: PPL distribution of human and synthetic data estimated by Llama-3-8B. The synthetic data lacks the long tail of the human-produced data and is also concentrated within the first $25\%$ of the human-produced data distribution. A. Distribution of human-produced data is sharp with a long tail, spanning a wide range from 0 to over 100. B. The values are concentrated within a much narrower range, mostly between 0 and 14. The same trend estimated by StableLM-3B is demonstrated in Figure \ref{['fig:ppl_StabLM-Zephyr-3B']}.
  • Figure 4: A. Pre-training results for selected synthetic data and other data mixtures on OLMo-237M. B. Embedding visualization between human-produced, synthetic, and DSIR-selected data using sentence-transformer.
  • Figure 5: U-shape token probability distribution of Dolma-sampled V6 estimated by Qwen-0.5B-Instruct qwen2.
  • ...and 9 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Theorem 2
  • Lemma 3