Table of Contents
Fetching ...

Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering

Zhongteng Cai, Yaxuan Wang, Yang Liu, Xueru Zhang

TL;DR

This work tackles instability in self-consuming diffusion models by analyzing latent representations and showing that latent subspaces degrade across generations, as captured by the Orthogonal Low-rank Embedding ($OLE$). It proposes Latent Space Filtering (LSF), which trains a probing classifier on real latent embeddings and uses the per-sample confidence \(\xi(\mathbf{x}, y)\) to filter out misaligned synthetic data, with a theoretical connection to subspace orthogonality. Empirically, ACU-LSF improves stability and image fidelity on real-world datasets (e.g., MNIST, CIFAR-10, CelebA) under a fixed training budget, outperforming baselines without requiring extra real data. The paper provides formal theorems linking latent-space degeneration to $OLE$ dynamics and to the confidence scores, grounding the approach in a solid theoretical framework.

Abstract

As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a ``self-consuming loop" that can lead to training instability or \textit{model collapse}. Common strategies to address the issue -- such as accumulating historical training data or injecting fresh real data -- either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose \textit{Latent Space Filtering} (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.

Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering

TL;DR

This work tackles instability in self-consuming diffusion models by analyzing latent representations and showing that latent subspaces degrade across generations, as captured by the Orthogonal Low-rank Embedding (). It proposes Latent Space Filtering (LSF), which trains a probing classifier on real latent embeddings and uses the per-sample confidence \(\xi(\mathbf{x}, y)\) to filter out misaligned synthetic data, with a theoretical connection to subspace orthogonality. Empirically, ACU-LSF improves stability and image fidelity on real-world datasets (e.g., MNIST, CIFAR-10, CelebA) under a fixed training budget, outperforming baselines without requiring extra real data. The paper provides formal theorems linking latent-space degeneration to dynamics and to the confidence scores, grounding the approach in a solid theoretical framework.

Abstract

As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a ``self-consuming loop" that can lead to training instability or \textit{model collapse}. Common strategies to address the issue -- such as accumulating historical training data or injecting fresh real data -- either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose \textit{Latent Space Filtering} (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.

Paper Structure

This paper contains 10 sections, 10 theorems, 32 equations, 7 figures, 1 algorithm.

Key Result

Theorem 1

For all $\theta\in [0,\frac{\pi}{2}]$, when $r > 2n$, the expected OLE satisfies the following: where and $\gamma(n) \coloneqq \sqrt{2} \cdot \frac{ \Gamma\left( \frac{n+1}{2} \right) }{ \Gamma\left( \frac{n}{2} \right) }$, with $\Gamma(\cdot)$ being Gamma function.

Figures (7)

  • Figure 1: OLE values of latent representations extracted by a fixed diffusion model across generations and denoising timesteps. Each curve corresponds to a generation within a pure synthetic self-consuming loop. When conditioned on timestep, OLE values increase with generation, indicating progressive structural degradation of the latent space. When conditioned on generation, OLE exhibits a U-shaped trend over denoising timesteps.
  • Figure 2: Correlation between the average generation number of each batch and its corresponding OLE score, computed on the accumulated CelebA dataset. Batches with lower average generation numbers (i.e., containing more realistic images) tend to have lower OLE values. However, the correlation is not strong enough to enable reliable filtering based solely on batch-level OLE.
  • Figure 3: Correlation between the OLE and average confidence scores of each batch, computed on the real and synthetic MNIST datasets produced at different generations. Batches with lower OLE (i.e., greater feature separability) tend to have higher confidence scores, indicating that we can use confidence score as an individual-level proxy of batch-level OLE score.
  • Figure 4: The effect of using confidence scores for filtering on MNIST dataset. (a) shows the generation number vs. distribution of confidence scores of images sampled at this generation. Higher generation is correlated with lower confidence score, indicating that they are less aligned with real images, hence providing a strong signal for filtering out unrealistic synthetic images. (b) shows the average generation number within the dataset filtered by the confidence score. (c) Our method can also preserve more real images in the training set. Using larger accumulated dataset for filtering can increase the ratio of selected real images. Compared with randomly sampling from the accumulated dataset, our filtering method can preserve a distribution closer to real images.
  • Figure 5: Performance comparison on MNIST and CelebA across three metrics: FID, precision, and recall. FID measures distributional distance between real and synthetic images (lower is better), while precision and recall assess fidelity and diversity of generated samples, respectively (higher is better). SYN suffers from model collapse. SYN-ADD partially alleviates it via fresh real data. ACU, ACUR, and ACU-LSF maintain stable metrics across generations. ACU-LSF achieves lower FID and higher fidelity than ACUR. ACUR-SIMS exhibits instability, while ACUR-SC suffers from reduced recall and elevated FID.
  • ...and 2 more figures

Theorems & Definitions (13)

  • Theorem 1: Lower bound of OLE
  • Remark 1
  • Theorem 2: Upper bound of confidence score
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • Lemma 5
  • Lemma 6
  • proof
  • ...and 3 more