How to Synthesize Text Data without Model Collapse?
Xuekai Zhu, Daixuan Cheng, Hengli Li, Kaiyan Zhang, Ermo Hua, Xingtai Lv, Ning Ding, Zhouhan Lin, Zilong Zheng, Bowen Zhou
TL;DR
The paper studies how synthetic data affects language-model training and shows that non-iterative mixing of synthetic and human data degrades performance due to distributional gaps and n-gram over-concentration. It introduces token-level editing (ToEdit), a prior-guided semi-synthetic data approach that preserves the source distribution while enhancing data quality, and proves a finite upper bound on test error to prevent model collapse. The authors validate ToEdit across pre-training from scratch, continual pre-training, and supervised fine-tuning on multiple models and domains, demonstrating consistent improvements without increasing data size. The work provides a practical data-synthesis strategy that mitigates collapse risk, with broad implications for the use of synthetic data in next-generation GPT-n-style models.
Abstract
Model collapse in synthetic data indicates that iterative training on self-generated data leads to a gradual decline in performance. With the proliferation of AI models, synthetic data will fundamentally reshape the web data ecosystem. Future GPT-$\{n\}$ models will inevitably be trained on a blend of synthetic and human-produced data. In this paper, we focus on two questions: what is the impact of synthetic data on language model training, and how to synthesize data without model collapse? We first pre-train language models across different proportions of synthetic data, revealing a negative correlation between the proportion of synthetic data and model performance. We further conduct statistical analysis on synthetic data to uncover distributional shift phenomenon and over-concentration of n-gram features. Inspired by the above findings, we propose token editing on human-produced data to obtain semi-synthetic data. As a proof of concept, we theoretically demonstrate that token-level editing can prevent model collapse, as the test error is constrained by a finite upper bound. We conduct extensive experiments on pre-training from scratch, continual pre-training, and supervised fine-tuning. The results validate our theoretical proof that token-level editing improves model performance.
