Table of Contents
Fetching ...

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

Yunzhen Feng, Elvis Dohmatob, Pu Yang, Francois Charton, Julia Kempe

TL;DR

The paper tackles model collapse when scaling training with synthetic data and proposes verifier-based data selection as a remedy. It provides a theoretical framework using Gaussian mixtures and pruning to show a sharp phase transition in downstream performance, governed by a breakdown point tied to verifier quality, and introduces a practical proxy p_* to predict outcomes. The authors validate the theory with simulations and two large-scale experiments—transformers on eigenvalue prediction and Llama-2-based XLSUM—demonstrating that appropriate verification can prevent collapse and even surpass the original generator under certain conditions. The work highlights verification as a scalable, data-efficient mechanism to harness synthesized data for high-performance models, while acknowledging limitations and the need for broader data-curation strategies in real-world settings.

Abstract

Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.

Beyond Model Collapse: Scaling Up with Synthesized Data Requires Verification

TL;DR

The paper tackles model collapse when scaling training with synthetic data and proposes verifier-based data selection as a remedy. It provides a theoretical framework using Gaussian mixtures and pruning to show a sharp phase transition in downstream performance, governed by a breakdown point tied to verifier quality, and introduces a practical proxy p_* to predict outcomes. The authors validate the theory with simulations and two large-scale experiments—transformers on eigenvalue prediction and Llama-2-based XLSUM—demonstrating that appropriate verification can prevent collapse and even surpass the original generator under certain conditions. The work highlights verification as a scalable, data-efficient mechanism to harness synthesized data for high-performance models, while acknowledging limitations and the need for broader data-curation strategies in real-world settings.

Abstract

Large Language Models (LLM) are increasingly trained on data generated by other LLM, either because generated text and images become part of the pre-training corpus, or because synthetized data is used as a replacement for expensive human-annotation. This raises concerns about \emph{model collapse}, a drop in model performance when their training sets include generated data. Considering that it is easier for both humans and machines to tell between good and bad examples than to generate high-quality samples, we investigate the use of verification on synthesized data to prevent model collapse. We provide a theoretical characterization using Gaussian mixtures, linear classifiers, and linear verifiers to derive conditions with measurable proxies to assess whether the verifier can effectively select synthesized data that leads to optimal performance. We experiment with two practical tasks -- computing matrix eigenvalues with transformers and news summarization with LLMs -- which both exhibit model collapse when trained on generated data, and show that verifiers, even imperfect ones, can indeed be harnessed to prevent model collapse and that our proposed proxy measure strongly correlates with performance.
Paper Structure (57 sections, 8 theorems, 71 equations, 6 figures, 3 tables)

This paper contains 57 sections, 8 theorems, 71 equations, 6 figures, 3 tables.

Key Result

Theorem 4.2

Let Assumption ass:independent-selection be in order. Fix $p$, $\phi,\psi$ and define the breakdown point $p_\star \in (0,1)$ by $p_\star := 1/(1+\psi/\phi)$. For the family of data distributions obeying Condition cond:main (including the Gaussian mixture), for a downstream model $\widehat{f}_N$ tra

Figures (6)

  • Figure 1: Illustrative figures for our proposal (a) and for the theoretical and simulation settings (b).
  • Figure 2: Empirical confirmation of Theorem \ref{['thm:main_dumbeddown']}. Comparing the breakdown points of different generators and pruners of different strengths. Synthesized data is generated from a linear model $w_{gen}$ with classification error rate $p=\theta_{gen}/\pi \in [0,1]$. The data is pruned with another linear model $w_{prune}$ which has classification error $\theta_{prune}/\pi$. Broken lines correspond to the prediction of Theorem \ref{['thm:main_dumbeddown']}, while solid points correspond to experiments. Notice the sharp phase transitions where the model suddenly switches from perfect accuracy to worse-than-chance, as the theorem predicts.
  • Figure 3: Simulations with Gaussian mixtures. (Top row) Relative error (accuracy relative to optimal accuracy) scaling as a function number of selected data, $n'$, used to train the model. $\tau=0.15, N_1=10^6$. The Bayes optimal classifier achieves approximately 94% accuracy on this distribution. (Bottom row)$p_*$ values for all settings.
  • Figure 4: Transformers computing eigenvalues. Correlation between accuracy with $1\%$ tolerance, $p_*$, and the number of synthesized data. Model collapse is observed without verification ($p_* = 0.5$), while higher values of $p_*$ result in improved performance. Results are averaged over 5 seeds.
  • Figure 5: News summarization with LLMs.(Top Row) The three figures from left to right represents models trained on 12.5%, 25%, and 50% of the selected data individually. Each figure includes four curves illustrating different training scenarios: (1) selection with oracle, (2) selection with Llama-3 as a weak supervision, (3) self-selection, and (4) random selection. Additionally, two horizontal lines are included for comparison: one representing the generator model and the other representing a model trained with 100% data with original labels. (Bottom row) Computed values of $p_*$ for the verifiers for corresponding proportions of selected data. Training with data selected by a verifier with higher $p_*$ achieves better performance.
  • ...and 1 more figures

Theorems & Definitions (12)

  • Theorem 4.2: Simplified version of Theorem \ref{['thm:main']}
  • Remark 4.3
  • Remark E.2
  • Theorem E.3
  • Corollary E.4
  • Remark E.5
  • Proposition F.1
  • proof
  • Proposition F.2
  • Proposition F.3
  • ...and 2 more