Table of Contents
Fetching ...

On the Stability of Iterative Retraining of Generative Models on their own Data

Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, Gauthier Gidel

TL;DR

The paper tackles the risk of instability and collapse when generative models are retrained iteratively on datasets that include their own synthetic data. It develops a mixed-data likelihood framework and proves local stability and a fixed-point structure under mild regularity, plus finite-sample approximate stability with a three-term error decomposition. The Gaussian warm-up demonstrates collapse when learning solely from self-generated data, motivating the mixed-data approach, which is validated experimentally on CIFAR-10 and FFHQ-64 across diffusion-model families. Practically, the work provides principled conditions to avoid collapse in self-consuming training pipelines and informs how to balance real versus synthetic data during iterative retraining. Overall, it advances theoretical understanding and offers empirical guidance for robust iterative retraining of high-capacity generative systems.

Abstract

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.

On the Stability of Iterative Retraining of Generative Models on their own Data

TL;DR

The paper tackles the risk of instability and collapse when generative models are retrained iteratively on datasets that include their own synthetic data. It develops a mixed-data likelihood framework and proves local stability and a fixed-point structure under mild regularity, plus finite-sample approximate stability with a three-term error decomposition. The Gaussian warm-up demonstrates collapse when learning solely from self-generated data, motivating the mixed-data approach, which is validated experimentally on CIFAR-10 and FFHQ-64 across diffusion-model families. Practically, the work provides principled conditions to avoid collapse in self-consuming training pipelines and informs how to balance real versus synthetic data during iterative retraining. Overall, it advances theoretical understanding and offers empirical guidance for robust iterative retraining of high-capacity generative systems.

Abstract

Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
Paper Structure (29 sections, 15 theorems, 61 equations, 9 figures, 1 algorithm)

This paper contains 29 sections, 15 theorems, 61 equations, 9 figures, 1 algorithm.

Key Result

Proposition 1

(Gaussian Collapse) For all initializations of the mean $\mu_0$ and the covariance $\Sigma_0$, iteratively learning a single multivariate Gaussian solely on its generated data yields model collapse. More precisely, if $\mu_t$ and $\Sigma_t$ follows sampling_steplearning_step, then, there exists $\al

Figures (9)

  • Figure 1: Samples generated from EDM trained on the FFHQ dataset. As observed in shumailov2023cursealemohammad2023selfconsuming, iteratively retraining the model exclusively on its own generated data yields degradation of the image (top row). On the other hand, retraining on a mix of half real and half synthetic data (middle) yields a similar quality as retraining on real data (bottom).
  • Figure 2: FID, precision, and recall of the generative models as a function of the number of retraining for multiple fractions $\lambda$ of generated data, $\mathcal{D} = \mathcal{D}_{\mathrm{real}} \cup \{ \tilde{\mathbf{x}}_i \}_{i=1}^{\lfloor \lambda \cdot n \rfloor}$, $\tilde{\mathbf{x}}_i \sim p_{\pmb{\theta}_t}$. For all models and datasets, only training on synthetic data (dashed red line with triangles) yields divergence. For the EDM models on CIFAR-$10$ (middle row), the iterative retraining is stable for all the proportions of generated data from $\lambda=0$ to $\lambda=1$. For the EDM on FFHQ-$64$ (bottom row), the iterative retraining is stable if the proportion of used generated data is small enough ($\lambda < 0.5$).
  • Figure 3: Stability vs. collapsing of iterative retraining of generative models on their own data. Each model's density is displayed as a function of the number of retraining steps. The first two columns correspond to the true density and the density of a diffusion model trained on the true data. As observed in shumailov2023cursealemohammad2023selfconsuming, iteratively retraining the model exclusively on its own generated data (top row) yields a density that collapses: samples very near the mean of each mode are sampled almost exclusively after $100$ iterations of retraining. Contrastingly, retraining on a mixture of true and generated data (bottom row) does not yield a collapsing density.
  • Figure 4: Stability vs. collapsing of iterative retraining of generative models on their own data. Each model's density is displayed as a function of the number of retraining steps. The first two columns correspond to the true density and the density of a diffusion model trained on the true data respectively.
  • Figure 5: FID, precision, and recall of the generative models as a function of the number of retraining for multiple fractions $\lambda$ of generated data, $\mathcal{D} = \mathcal{D}_{\mathrm{real}} \cup \{ \tilde{\mathbf{x}}_i \}_{i=1}^{\lfloor \lambda \cdot n \rfloor}$, $\tilde{\mathbf{x}}_i \sim p_{\pmb{\theta}_t}$. Only training on synthetic data (dashed red line with triangles) yields divergence.
  • ...and 4 more figures

Theorems & Definitions (30)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Theorem 1
  • Theorem 2
  • Proposition 4
  • Lemma A.1
  • proof
  • proof
  • ...and 20 more