Stabilizing Self-Consuming Diffusion Models with Latent Space Filtering
Zhongteng Cai, Yaxuan Wang, Yang Liu, Xueru Zhang
TL;DR
This work tackles instability in self-consuming diffusion models by analyzing latent representations and showing that latent subspaces degrade across generations, as captured by the Orthogonal Low-rank Embedding ($OLE$). It proposes Latent Space Filtering (LSF), which trains a probing classifier on real latent embeddings and uses the per-sample confidence \(\xi(\mathbf{x}, y)\) to filter out misaligned synthetic data, with a theoretical connection to subspace orthogonality. Empirically, ACU-LSF improves stability and image fidelity on real-world datasets (e.g., MNIST, CIFAR-10, CelebA) under a fixed training budget, outperforming baselines without requiring extra real data. The paper provides formal theorems linking latent-space degeneration to $OLE$ dynamics and to the confidence scores, grounding the approach in a solid theoretical framework.
Abstract
As synthetic data proliferates across the Internet, it is often reused to train successive generations of generative models. This creates a ``self-consuming loop" that can lead to training instability or \textit{model collapse}. Common strategies to address the issue -- such as accumulating historical training data or injecting fresh real data -- either increase computational cost or require expensive human annotation. In this paper, we empirically analyze the latent space dynamics of self-consuming diffusion models and observe that the low-dimensional structure of latent representations extracted from synthetic data degrade over generations. Based on this insight, we propose \textit{Latent Space Filtering} (LSF), a novel approach that mitigates model collapse by filtering out less realistic synthetic data from mixed datasets. Theoretically, we present a framework that connects latent space degradation to empirical observations. Experimentally, we show that LSF consistently outperforms existing baselines across multiple real-world datasets, effectively mitigating model collapse without increasing training cost or relying on human annotation.
