Table of Contents
Fetching ...

Self-Correcting Self-Consuming Loops for Generative Model Training

Nate Gillman, Michael Freeman, Daksh Aggarwal, Chia-Hong Hsu, Calvin Luo, Yonglong Tian, Chen Sun

TL;DR

The paper tackles training generative models on data that mixes real and machine-generated content, a setting prone to self-consuming loops and collapse. It introduces a distributional self-correction operator $\pi_\gamma$ that blends the current model distribution with the optimal one $p_{\theta^*}$ to stabilize learning, and proves an exponential stability bound under mild regularity assumptions. The authors implement a practical self-correction via physics-based simulation (UHC in MuJoCo) for human motion synthesis and validate the method on toy Gaussian and MNIST tasks, plus a challenging motion dataset, showing improved stability and motion realism even when synthetic data dominates. The results indicate that self-correcting loops can extend safe synthetic data usage across domains, offering scalable, automated stabilization that mitigates collapse and enhances output quality. The work provides theoretical guarantees and empirical evidence that correction strength $\gamma$ can enable larger augmentation $\lambda$, improving convergence rates and robustness in generative training with synthetic data.

Abstract

As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates "self-consuming loops" which may lead to training instability or even collapse, unless certain conditions are met. Our paper aims to stabilize self-consuming generative model training. Our theoretical results demonstrate that by introducing an idealized correction function, which maps a data point to be more likely under the true data distribution, self-consuming loops can be made exponentially more stable. We then propose self-correction functions, which rely on expert knowledge (e.g. the laws of physics programmed in a simulator), and aim to approximate the idealized corrector automatically and at scale. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.

Self-Correcting Self-Consuming Loops for Generative Model Training

TL;DR

The paper tackles training generative models on data that mixes real and machine-generated content, a setting prone to self-consuming loops and collapse. It introduces a distributional self-correction operator that blends the current model distribution with the optimal one to stabilize learning, and proves an exponential stability bound under mild regularity assumptions. The authors implement a practical self-correction via physics-based simulation (UHC in MuJoCo) for human motion synthesis and validate the method on toy Gaussian and MNIST tasks, plus a challenging motion dataset, showing improved stability and motion realism even when synthetic data dominates. The results indicate that self-correcting loops can extend safe synthetic data usage across domains, offering scalable, automated stabilization that mitigates collapse and enhances output quality. The work provides theoretical guarantees and empirical evidence that correction strength can enable larger augmentation , improving convergence rates and robustness in generative training with synthetic data.

Abstract

As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data. Despite the successful stories of using synthetic data for representation learning, using synthetic data for generative model training creates "self-consuming loops" which may lead to training instability or even collapse, unless certain conditions are met. Our paper aims to stabilize self-consuming generative model training. Our theoretical results demonstrate that by introducing an idealized correction function, which maps a data point to be more likely under the true data distribution, self-consuming loops can be made exponentially more stable. We then propose self-correction functions, which rely on expert knowledge (e.g. the laws of physics programmed in a simulator), and aim to approximate the idealized corrector automatically and at scale. We empirically validate the effectiveness of self-correcting self-consuming loops on the challenging human motion synthesis task, and observe that it successfully avoids model collapse, even when the ratio of synthetic data to real data is as high as 100%.
Paper Structure (31 sections, 9 theorems, 54 equations, 20 figures)

This paper contains 31 sections, 9 theorems, 54 equations, 20 figures.

Key Result

Theorem 4.3

Fix an augmentation percentage $\lambda\in\mathbb R_{>0}$ and a correction strength $\gamma\in\mathbb R_{\ge 0}$. Suppose we have an iterative fine-tuning procedure defined by the rule $\theta_{t+1}^n=\pi_\gamma\mathcal{G}_\lambda^n(\theta_t^n)$, and suppose that Assumption assumption:body_of_paper and fix any $\delta\in(0,1)$. If $\theta_0$ is sufficiently close to $\theta^\star$, and if $\lambd

Figures (20)

  • Figure 1: What happens after iteratively training a text-conditioned generative model for human motion synthesis for 50 generations? We simulate a self-consuming loop by creating synthetic data with the latest generative model, and mixing them with the original data to continue training the next generative model. We observe that by self-correcting the synthetic data with a physics simulator, the model can successfully avoid collapse and generate high-quality human motion. Faded poses represent poses from further back in time. Our paper provides theoretical and empirical justification for the self-correcting self-consuming loop.
  • Figure 2: Empirical results from our Gaussian toy example. The graph demonstrates that increasing the correction strength $\gamma$, with a fixed augmentation ratio of $\lambda=0.5$, improves performance and stability after self-consuming iterations.
  • Figure 3: Empirical results from our MNIST toy example. These synthesized images demonstrate that after 50 self-consuming iterations at 150% augmentation percentage, the model which is trained using iterative fine-tuning with self-correction is able to generate higher quality samples than the model trained using iterative fine-tuning without any self-correction.
  • Figure 4: Results from our human motion experiments on iterative fine-tuning with self-correction. These graphs show evaluation metrics for the last checkpoint for every generation. This is the checkpoint used for sampling in the iterative fine-tuning experiments, and it is also the checkpoint where training is resumed with this new partially synthesized dataset. We can see that with self-correction, the iterative fine-tuning procedure more stably converges to a better FID score, and more quickly. When the dataset size is smaller ($n=64$, above) we can see that iterative fine-tuning with no self-correction has a flat Matching score, as well as diverging FID and Diversity scores, indicating model collapse. And when the dataset size is larger ($n=2794$, below), there is less collapse for iterative fine-tuning with no self-correction, although the variance of the FID score is worse, as is the average FID across generations. In both cases, we see that iterative fine-tuning with self-correction outperforms iterative fine-tuning with no self-correction, and is competitive with the baseline after many generations.
  • Figure 5: How does the self-correction operation affect iterative fine-tuning, qualitatively? Here we present some visualizations. The prompt which describes the ground truth motion, and which we use to generate the three other motions, is: "a person stands with feet wide, stretches both hands up over his head and then swings down by the waist and hangs arms down before standing up". We can see that the iterative fine-tuning model produces a motion where the human moves closer to the camera than the others; this is evidence of model collapse, as moving feet is irrelevant to the prompt. Additionally, this motion produces single frames that suddenly snap to a physically impossible position--note the leg penetration through the ground plane. These negative artifacts do not exist in the motions synthesized from the ground truth, baseline model, or iterative fine-tuning with self-correction model. Lastly, we note that the iterative fine-tuning motion depicted here is semantically similar to crawling. We observe in our experiments with smaller dataset sizes that the iterative fine-tuning model generates less diverse outputs than the baseline model and the iterative fine-tuning with self-correction model, and that this crawling pattern appears more often in the latter. Each snapshot is taken at exactly frame 105 of their respective videos. The two motions on the right come from models that were iteratively fine-tuned for 50 generations, with a train set of size $n=64$, and a synthetic augmentation percentage of $25\%$. For all pictures of the human, the camera is fixed at the same position, and for consistency the image is not resized.
  • ...and 15 more figures

Theorems & Definitions (31)

  • Definition 4.1
  • Theorem 4.3: Stability of Iterative Fine-Tuning with Correction
  • Remark 4.4
  • Corollary 4.5
  • proof : Proof of Corollary \ref{['corollary:main_theorem']}
  • Example 4.6
  • Conjecture 4.7
  • Definition 1.1
  • Lemma 1.2
  • proof
  • ...and 21 more