Self-Improving Diffusion Models with Synthetic Data

Sina Alemohammad; Ahmed Imtiaz Humayun; Shruti Agarwal; John Collomosse; Richard Baraniuk

Self-Improving Diffusion Models with Synthetic Data

Sina Alemohammad, Ahmed Imtiaz Humayun, Shruti Agarwal, John Collomosse, Richard Baraniuk

TL;DR

Self-IMproving diffusion models with Synthetic data (SIMS) tackle Model Autophagy Disorder (MAD) by using a diffusion-base score learned from real data, and an auxiliary score learned from self-generated synthetic data, to form a negative-guidance mechanism that steers generation toward the real data distribution. The core idea is to extrapolate between the base score $\mathbf{s}_{\theta_r}$ and the auxiliary score $\mathbf{s}_{\theta_s}$ via the guidance form $\mathbf{s}_\theta(\mathbf{x}_t,t)=(1+\omega)\mathbf{s}_{\theta_r}(\mathbf{x}_t,t)-\omega\mathbf{s}_{\theta_s}(\mathbf{x}_t,t)$, with hyperparameters $n_s$ and training budget $\mathcal{B}$ governing auxiliary-model influence. Empirically, SIMS achieves state-of-the-art FID on CIFAR-10 and ImageNet-64 while remaining competitive on FFHQ-64 and ImageNet-512, and demonstratesMAD prevention in synthetic augmentation loops as well as the ability to shift the synthetic data distribution toward a chosen in-domain target distribution for fairness. This work introduces a prophylactic, self-contained framework that enables iterative training on self-generated data without MAD, potentially informing safer deployment of synthetic data in large-scale diffusion models and beyond.

Abstract

The artificial intelligence (AI) world is running out of real data for training increasingly large generative models, resulting in accelerating pressure to train on synthetic data. Unfortunately, training new generative models with synthetic data from current or past generation models creates an autophagous (self-consuming) loop that degrades the quality and/or diversity of the synthetic data in what has been termed model autophagy disorder (MAD) and model collapse. Current thinking around model autophagy recommends that synthetic data is to be avoided for model training lest the system deteriorate into MADness. In this paper, we take a different tack that treats synthetic data differently from real data. Self-IMproving diffusion models with Synthetic data (SIMS) is a new training concept for diffusion models that uses self-synthesized data to provide negative guidance during the generation process to steer a model's generative process away from the non-ideal synthetic data manifold and towards the real data distribution. We demonstrate that SIMS is capable of self-improvement; it establishes new records based on the Fréchet inception distance (FID) metric for CIFAR-10 and ImageNet-64 generation and achieves competitive results on FFHQ-64 and ImageNet-512. Moreover, SIMS is, to the best of our knowledge, the first prophylactic generative AI algorithm that can be iteratively trained on self-generated synthetic data without going MAD. As a bonus, SIMS can adjust a diffusion model's synthetic data distribution to match any desired in-domain target distribution to help mitigate biases and ensure fairness.

Self-Improving Diffusion Models with Synthetic Data

TL;DR

and the auxiliary score

via the guidance form

, with hyperparameters

and training budget

governing auxiliary-model influence. Empirically, SIMS achieves state-of-the-art FID on CIFAR-10 and ImageNet-64 while remaining competitive on FFHQ-64 and ImageNet-512, and demonstratesMAD prevention in synthetic augmentation loops as well as the ability to shift the synthetic data distribution toward a chosen in-domain target distribution for fairness. This work introduces a prophylactic, self-contained framework that enables iterative training on self-generated data without MAD, potentially informing safer deployment of synthetic data in large-scale diffusion models and beyond.

Abstract

Paper Structure (23 sections, 5 equations, 13 figures, 1 table, 2 algorithms)

This paper contains 23 sections, 5 equations, 13 figures, 1 table, 2 algorithms.

Introduction
Background
SIMS: Self Improvement with Synthetic Data
SIMS: Extrapolating to Self-Improvement.
Experimental Results
Self-Improving Diffusion Models
MAD Prevention using SIMS
Two dimensional Gaussian Data in a Synthetic Augmentation Loop
Realistic Data in a Synthetic Augmentation Loop
Experimental setup.
Results.
Distribution Shifts with SIMS
Discussion
Acknowledgement
Ablation Studies for SIMS
...and 8 more sections

Figures (13)

Figure 1: Self-IMproving diffusion models with Synthetic data (SIMS) simultaneously improves diffusion modeling and synthesis performance while acting as a prophylactic against Model Autophagy Disorder (MAD). First row: Samples from a base diffusion model (EDM2-S kynkaanniemi2024applying) trained on $1.28$M real images from the ImageNet-512 dataset karras2024analyzing (FrÃ©chet inception distance, FID = $2.56$). Second row: Samples from the base model after fine-tuning with $1.5$M images synthesized from the base model, which degrades synthesis performance and pushes the model towards MADness alemohammad2023arxivalemohammad2024selfconsuming (FID = $6.07$). Third row: Samples from the base model after applying SIMS using the same self-generated synthetic data as in the second row (FID = $1.73$).
Figure 2: SIMS simultaneously self-improves diffusion model modeling and synthesis performance while acting as a prophylactic against MAD. SIMS improves the score function ${\bm{s}}_{\theta_{\rm r}}({\bm{x}}_t, t)$ for a base diffusion model trained on real data by training an auxiliary model on the same real data plus synthetic data from the base model. The score function ${\bm{s}}_{\theta_{\rm s}}({\bm{x}}_t, t)$ of the auxiliary model can be combined with that of the base model to extrapolate a new score function (denoted SIMS) that is closer to the real data distribution.
Figure 3: SIMS consistently self-improves diffusion models. Top row: FID between the SIMS model from Algorithm \ref{['alg:sims']} and the real data distribution as a function of the guidance parameter $\omega$ at three different checkpoints of the training budget $\mathcal{B}$ as measured by the number of million-images-seen (Mi) during fine tuning of the auxiliary model. Bottom row: FID of the SIMS model as a function of training budget for three different values of the guidance parameter $\omega$.
Figure 4: SIMS simultaneously self-improves and prevents MADness in the synthetic augmentation self-consuming loop. We compare standard synthetic augmentation training alemohammad2023arxivalemohammad2024selfconsuming to SIMS training in a synthetic augmentation loop across 100 generations for two-dimensional Gaussian data. Standard training corresponds to guidance $\omega=0$ in all cases. At top left, we confirm SIMS's self-improvement by noting that, for a wide range of $\omega$, the expected Wasserstein distance $\mathbb{E}[ \mathrm{dist}(\mathcal{G}^1,p_{\rm r}) ]$ between the first generation model $\mathcal{G}^1 = \mathcal{A}(\mathcal{D}_{\rm r})$ and the real data distribution drops. At the bottom, we confirm that SIMS can act a prophylactic for MADness. We plot $\frac{\mathbb{E}[ \mathrm{dist}(\mathcal{G}^t,p_{\rm r}) ]}{\mathbb{E}[ \mathrm{dist}(\mathcal{G}^1,p_{\rm r}) ]}$, the ratio of the expected Wasserstein Distance at generation $t$ to that at generation 1 for $|\mathcal{D}_{\rm s}^t|=250$ and 125. The green/orange/purple curves correspond to weak MADness mitigation/strong MADness mitigation/MADness prevention. At top right, we plot the normalized expected Wasserstein distance at convergence as a function of $\omega$ for four different synthetic data sizes $|\mathcal{D}_{\rm s}^t|$. A guidance parameter of $\omega\approx 3$ results in either strong MADness mitigation or complete MADness prevention.
Figure 5: SIMS acts as a prophylactic against MADness for realistic training datasets polluted with synthetic data. For the CIFAR-10 (50k real images, left) and FFHQ-64 (70k real images, right) datasets, we plot the FID of the four training scenarios from Section \ref{['sec:realistic']} as a function of the amount of polluting synthetic data $|\mathcal{D}_{\rm p}|$. While the modeling performance of standard training is strongly affected by increasing amounts of synthetic data pollution (compare $\mathcal{G}^2_{\text{ST-P}}$ to $\mathcal{G}^2_{\text{ST-I}}$), the performance of SIMS training is relatively immune (compare $\mathcal{G}^2_{\text{SIMS{}-P}}$ to $\mathcal{G}^2_{\text{SIMS{}-I}}$).
...and 8 more figures

Theorems & Definitions (2)

Definition 1
Definition 2

Self-Improving Diffusion Models with Synthetic Data

TL;DR

Abstract

Self-Improving Diffusion Models with Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (13)

Theorems & Definitions (2)