Table of Contents
Fetching ...

A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

Lianghe Shi, Meng Wu, Huijie Zhang, Zekai Zhang, Molei Tao, Qing Qu

TL;DR

This work reveals a practical model-collapse mechanism in diffusion models trained on self-generated data: a transition from generalization to memorization that is tightly linked to the declining entropy of the training set. By quantifying entropy with the Kozachenko-Leonenko estimator and coupling it to a generalization score that measures novelty relative to training data, the authors show that reduced dataset entropy precedes memorization and correlates strongly with degraded generation. They propose entropy-based data-selection strategies, including Greedy Selection and Threshold Decay Filter, to construct high-entropy training subsets and thus slow or prevent the collapse, achieving improved image quality and diversity (lower FID) in recursive generation and CFG settings. The findings offer a practical pathway to robust diffusion-model training in iterative, data-curating scenarios and highlight entropy as a key criterion for maintaining generalization in self-consuming loops.

Abstract

The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse -- a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

A Closer Look at Model Collapse: From a Generalization-to-Memorization Perspective

TL;DR

This work reveals a practical model-collapse mechanism in diffusion models trained on self-generated data: a transition from generalization to memorization that is tightly linked to the declining entropy of the training set. By quantifying entropy with the Kozachenko-Leonenko estimator and coupling it to a generalization score that measures novelty relative to training data, the authors show that reduced dataset entropy precedes memorization and correlates strongly with degraded generation. They propose entropy-based data-selection strategies, including Greedy Selection and Threshold Decay Filter, to construct high-entropy training subsets and thus slow or prevent the collapse, achieving improved image quality and diversity (lower FID) in recursive generation and CFG settings. The findings offer a practical pathway to robust diffusion-model training in iterative, data-curating scenarios and highlight entropy as a key criterion for maintaining generalization in self-consuming loops.

Abstract

The widespread use of diffusion models has led to an abundance of AI-generated data, raising concerns about model collapse -- a phenomenon in which recursive iterations of training on synthetic data lead to performance degradation. Prior work primarily characterizes this collapse via variance shrinkage or distribution shift, but these perspectives miss practical manifestations of model collapse. This paper identifies a transition from generalization to memorization during model collapse in diffusion models, where models increasingly replicate training data instead of generating novel content during iterative training on synthetic samples. This transition is directly driven by the declining entropy of the synthetic training data produced in each training cycle, which serves as a clear indicator of model degradation. Motivated by this insight, we propose an entropy-based data selection strategy to mitigate the transition from generalization to memorization and alleviate model collapse. Empirical results show that our approach significantly enhances visual quality and diversity in recursive generation, effectively preventing collapse.

Paper Structure

This paper contains 46 sections, 7 equations, 22 figures, 2 tables, 2 algorithms.

Figures (22)

  • Figure 1: High-level depiction of the self-consuming pipeline.Top:Collapse iteration represents the replace paradigm where models are trained solely on synthetic images generated by the previous diffusion model. Middle: In the mitigated iteration, original real data and previously generated data are added to train the next-generation model. Our proposed selection methods construct a training subset, further mitigating collapse. Bottom Right: Evolution of the generated images.
  • Figure 2: The generalization-to-memorization transition.Left: visualization of the generated images ($\mathcal{G}_n$) and their nearest neighbors in the training dataset ($\mathcal{D}_n$). As the iteration proceeds, the model can only copy images from the training dataset. Right: quantitative results of the generalization score of models over successive iterations. We use different colors to represent different dataset sizes. A smaller dataset has a larger decaying rate and even falls in the memorization regime at the start pmlr-v235-zhang24cn. We use "iteration" to denote a full cycle of training and generation, rather than a gradient update.
  • Figure 3: Decreasing entropy and visualizations.Left: The evolving entropy of the training dataset over iterations. Under the replace paradigm, the training data is the generated data from the last iteration. Middle and Right:$2$-D projection of data points onto the first two singular bases of the real dataset. The orange points represent the generated images at the $1$-st and $21$-st iterations, respectively.
  • Figure 4: Scatter plots of the generalization score and properties of the training dataset, i.e., entropy and variance. Each point denotes one iteration of training in the self-consuming loop. We use different colors to represent the results of different dataset sizes.
  • Figure 5: Generalization Score of the trained model over iterations. We indicate the settings on top of the subfigures. In each subfigure, three different lines are used to represent the vanilla paradigm and its variants augmented with the proposed selection methods.
  • ...and 17 more figures

Theorems & Definitions (1)

  • Definition 3.1: Differential Entropy cover1999elements