Self-Improving Diffusion Models with Synthetic Data
Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, Richard Baraniuk
TL;DR
Self-IMproving diffusion models with Synthetic data (SIMS) tackle Model Autophagy Disorder (MAD) by using a diffusion-base score learned from real data, and an auxiliary score learned from self-generated synthetic data, to form a negative-guidance mechanism that steers generation toward the real data distribution. The core idea is to extrapolate between the base score $\mathbf{s}_{\theta_r}$ and the auxiliary score $\mathbf{s}_{\theta_s}$ via the guidance form $\mathbf{s}_\theta(\mathbf{x}_t,t)=(1+\omega)\mathbf{s}_{\theta_r}(\mathbf{x}_t,t)-\omega\mathbf{s}_{\theta_s}(\mathbf{x}_t,t)$, with hyperparameters $n_s$ and training budget $\mathcal{B}$ governing auxiliary-model influence. Empirically, SIMS achieves state-of-the-art FID on CIFAR-10 and ImageNet-64 while remaining competitive on FFHQ-64 and ImageNet-512, and demonstratesMAD prevention in synthetic augmentation loops as well as the ability to shift the synthetic data distribution toward a chosen in-domain target distribution for fairness. This work introduces a prophylactic, self-contained framework that enables iterative training on self-generated data without MAD, potentially informing safer deployment of synthetic data in large-scale diffusion models and beyond.
Abstract
The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fréchet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.
