Table of Contents
Fetching ...

Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

Nail B. Khelifa, Richard E. Turner, Ramji Venkataramanan

TL;DR

The paper analyzes the drift and potential collapse of score-based diffusion models trained recursively on mixtures of real and synthetic data. It develops a Girsanov-based framework to relate pathwise score errors to intra-generation and accumulated divergences, establishing upper and lower bounds and proving a two-sided equivalence between intra-generation divergence and score-energy in the perturbative regime. It further shows how fresh-data refresh with fraction α induces geometric forgetting of past errors, yielding a discounted decomposition for the accumulated divergence with explicit bias terms, and introduces an adaptive tail condition to manage rare-tail contributions. The results provide theoretical guarantees on when recursive diffusion training remains stable (finite memory) or collapses (unbounded divergence), offering practical guidance for choosing α and monitoring score-error budgets in recursive generation pipelines.

Abstract

Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.

Error Propagation and Model Collapse in Diffusion Models: A Theoretical Study

TL;DR

The paper analyzes the drift and potential collapse of score-based diffusion models trained recursively on mixtures of real and synthetic data. It develops a Girsanov-based framework to relate pathwise score errors to intra-generation and accumulated divergences, establishing upper and lower bounds and proving a two-sided equivalence between intra-generation divergence and score-energy in the perturbative regime. It further shows how fresh-data refresh with fraction α induces geometric forgetting of past errors, yielding a discounted decomposition for the accumulated divergence with explicit bias terms, and introduces an adaptive tail condition to manage rare-tail contributions. The results provide theoretical guarantees on when recursive diffusion training remains stable (finite memory) or collapses (unbounded divergence), offering practical guidance for choosing α and monitoring score-error budgets in recursive generation pipelines.

Abstract

Machine learning models are increasingly trained or fine-tuned on synthetic data. Recursively training on such data has been observed to significantly degrade performance in a wide range of tasks, often characterized by a progressive drift away from the target distribution. In this work, we theoretically analyze this phenomenon in the setting of score-based diffusion models. For a realistic pipeline where each training round uses a combination of synthetic data and fresh samples from the target distribution, we obtain upper and lower bounds on the accumulated divergence between the generated and target distributions. This allows us to characterize different regimes of drift, depending on the score estimation error and the proportion of fresh data used in each generation. We also provide empirical results on synthetic data and images to illustrate the theory.
Paper Structure (59 sections, 17 theorems, 240 equations, 10 figures)

This paper contains 59 sections, 17 theorems, 240 equations, 10 figures.

Key Result

Proposition 3.1

Under ass:finite-drift-girsanov--ass:exponential-is-a-martingale, $\hat{\mathbb P}_i\ll \mathbb P_i^\star$ and, by data processing inequality (marginalization to time $t_0$),

Figures (10)

  • Figure 1: Samples generated by a recursively trained model, projected onto the first two principal components, with $p_\mathrm{data}$ a 10 dimensional Gaussian mixture. Columns correspond to fresh data fractions $\alpha \in \{0.1, 0.5, 0.9\}$ (left to right); rows show generations 0 (start), 10 (middle), and 20 (end). At $\alpha = 0.1$ (left column), the distribution progressively disperses as errors accumulate without sufficient fresh data. At $\alpha = 0.5$ (center), moderate degradation occurs with visible spreading but preserved structure. At $\alpha = 0.9$ (right column), the distribution remains stable throughout, demonstrating that high fresh data fractions effectively prevent collapse. Experiment details in Appendix \ref{['subsec:experimental-setup-10gmm']}.
  • Figure 2: Observability of three classes of score perturbations in CIFAR-10 cifar10 diffusion sampling. We inject controlled perturbations $\mathbf{e}_i(\mathbf{x},t)$ and estimate the observability coefficient $\hat{\eta}_i = \mathrm{Var}(\widehat{\mathbb{E}}[M_i \mid \mathbf{Y}_{t_0}^{i,\star}]) / \mathrm{Var}(M_i)$ using paired reverse trajectories with shared diffusion noise. Aligned: $\mathbf{e}_i(\mathbf{x},t) = \mathbf{w}_i \mathbf{x}$ with $\mathbf{w}_i$ chosen so that the error points in a direction correlated with the drift (i.e., errors push trajectories further along where they are already going). Random: $\mathbf{e}_i(\mathbf{x},t) = \mathbf{w}\mathbf{x}$ with random $\mathbf{w}$ independent of the drift direction Time-only: $\mathbf{e}_i(\mathbf{x},t) = \xi(t)$, independent of state. State-dependent perturbations yield higher $\hat{\eta}_i$, confirming that errors coupled to the sample state are more visible at the terminal distribution.
  • Figure 3: Empirical verification of intra-generational error bounds for a Gaussian mixture.$p_\mathrm{data} \sim \tfrac{1}{5} \sum_{k=1}^{5} \mathcal{N}(\boldsymbol{\mu}_k, \sigma^2 \mathbf{I}_{10})\in \mathbb{R}^{10}$. Left: Upper bound verification (Proposition \ref{['prop:upper-main']}). The KL divergence $\mathrm{KL}(\hat{p}^{i+1} \| q_i)$ between the learned distribution and the training mixture remains bounded by $\frac{1}{2}\hat{\varepsilon}_i^2$. The bound is tight at early generations and becomes conservative as the system equilibrates. Right: Lower bound verification (Proposition \ref{['prop:lower-main']}). The $\chi^2$ divergence $I_i = \chi^2(\hat{p}^{i+1} \| q_i)$ is bounded below by $\frac{1}{8}\hat{\eta}_i \varepsilon_{\star,i}^2$, where $\hat{\eta}_i$ is the estimated observability coefficient. Shaded regions indicate $\pm 1$ standard deviation across runs. Both bounds hold consistently across $\alpha \in \{0.1, 0.5\}$, validating the theoretical framework. Experiment details are given in Appendix \ref{['subsec:experimental-setup-10gmm']}.
  • Figure 4: Two-sided control of intra-generation divergence.$p_\mathrm{data} \sim \tfrac{1}{5} \sum_{k=1}^{5} \mathcal{N}(\boldsymbol{\mu}_k, \sigma^2 \mathbf{I}_{10})\in \mathbb{R}^{10}$. We track the evolution of the divergences $I_i = \chi^2(\hat{p}^{i+1} \| q_i)$ and $I_i^\mathrm{KL} = \mathrm{KL}(\hat{p}^{i+1} \| q_i)$ (solid lines with markers) across 20 generations for low ($\alpha=0.1$, left) and high ($\alpha=0.5$, right) fresh data ratios. Consistent with Theorem \ref{['thm:two-sided-main']}, the intra-generational divergences are effectively lower and upper bounded between by the pathwise score error energy (dashed upper bound $=4\varepsilon_{\star,i}^2$) and the observability-weighted lower bound (dotted line $=\frac{1}{4} \hat{\eta}_i \varepsilon_{\star,i}^2$, where $\hat{\eta}_i$ is the estimated observability of errors, as detailed in Appendix \ref{['subsec:experimental-setup-10gmm']}). Shaded regions indicate $\pm 1$ standard deviation across runs.
  • Figure 5: Geometrically-discounted decomposition of accumulated divergence in \ref{['eq:DN1']}, for $p_\mathrm{data} = \tfrac{1}{5} \sum_{k=1}^{5} \mathcal{N}(\boldsymbol{\mu}_k, \sigma^2 \mathbf{I}_{10})\in \mathbb{R}^{10}$. Each panel shows contributions of $\varepsilon_{\star,i}^2$ to the current global divergence $D_N = \chi^2(\hat{p}^N\,\|\, p_{\text{data}})$, for $i \le N$. Left ($\alpha=0.1$): Errors persist across many generations, creating a wide band of contributions. Center ($\alpha=0.5$): Intermediate regime with moderate memory decay. Right ($\alpha=0.9$): Short memory---only the most recent generation dominates, yielding a sharp diagonal structure. The plots are consistent with Theorem \ref{['thm:divergence-decomposition-finite-horizon-main']}, confirming that higher fractions of fresh data accelerate the forgetting of past errors. Experiment details in Appendix \ref{['subsec:experimental-setup-10gmm']}.
  • ...and 5 more figures

Theorems & Definitions (32)

  • Proposition 3.1: Intra-generational upper bound
  • Definition 3.2: Observability of Errors Coefficient
  • Proposition 3.3: Intra-generational Lower Bound
  • Theorem 3.4: Two-Sided bounds for intra-generational divergence
  • Proposition 4.1: Persistent Errors
  • Theorem 4.2: Discounted accumulation of errors
  • proof : Proof of Proposition \ref{['prop:upper-main']}
  • Proposition 2.1: Marginal Ratio Representation
  • proof
  • Lemma 3.1: Observability of Errors Transfer
  • ...and 22 more