On the Stability of Iterative Retraining of Generative Models on their own Data
Quentin Bertrand, Avishek Joey Bose, Alexandre Duplessis, Marco Jiralerspong, Gauthier Gidel
TL;DR
The paper tackles the risk of instability and collapse when generative models are retrained iteratively on datasets that include their own synthetic data. It develops a mixed-data likelihood framework and proves local stability and a fixed-point structure under mild regularity, plus finite-sample approximate stability with a three-term error decomposition. The Gaussian warm-up demonstrates collapse when learning solely from self-generated data, motivating the mixed-data approach, which is validated experimentally on CIFAR-10 and FFHQ-64 across diffusion-model families. Practically, the work provides principled conditions to avoid collapse in self-consuming training pipelines and informs how to balance real versus synthetic data during iterative retraining. Overall, it advances theoretical understanding and offers empirical guidance for robust iterative retraining of high-capacity generative systems.
Abstract
Deep generative models have made tremendous progress in modeling complex data, often exhibiting generation quality that surpasses a typical human's ability to discern the authenticity of samples. Undeniably, a key driver of this success is enabled by the massive amounts of web-scale data consumed by these models. Due to these models' striking performance and ease of availability, the web will inevitably be increasingly populated with synthetic content. Such a fact directly implies that future iterations of generative models will be trained on both clean and artificially generated data from past models. In this paper, we develop a framework to rigorously study the impact of training generative models on mixed datasets -- from classical training on real data to self-consuming generative models trained on purely synthetic data. We first prove the stability of iterative training under the condition that the initial generative models approximate the data distribution well enough and the proportion of clean training data (w.r.t. synthetic data) is large enough. We empirically validate our theory on both synthetic and natural images by iteratively training normalizing flows and state-of-the-art diffusion models on CIFAR10 and FFHQ.
