Table of Contents
Fetching ...

When Models Don't Collapse: On the Consistency of Iterative MLE

Daniel Barzilai, Ohad Shamir

TL;DR

This work analyzes iterative maximum likelihood estimation under accumulating synthetic data and establishes non-asymptotic guarantees that prevent model collapse as the real-data fraction vanishes, provided per-iteration sample size grows polylogarithmically with the number of iterations. It formalizes a setting with data from a ground-truth distribution, iterative re-estimation on accumulated data, and standard regularity assumptions including a positive definite Fisher information in a neighborhood. The paper also proves strong negative results showing that without additional structural assumptions (beyond MLE consistency), model collapse can occur arbitrarily quickly or even after a single iteration, highlighting the necessity of smoothness-like conditions for stability. Together, these results offer a precise boundary between when iterative data accumulation is safe and when it is prone to rapid degradation, guiding theoretical understanding and future work on mitigating collapse in practice.

Abstract

The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.

When Models Don't Collapse: On the Consistency of Iterative MLE

TL;DR

This work analyzes iterative maximum likelihood estimation under accumulating synthetic data and establishes non-asymptotic guarantees that prevent model collapse as the real-data fraction vanishes, provided per-iteration sample size grows polylogarithmically with the number of iterations. It formalizes a setting with data from a ground-truth distribution, iterative re-estimation on accumulated data, and standard regularity assumptions including a positive definite Fisher information in a neighborhood. The paper also proves strong negative results showing that without additional structural assumptions (beyond MLE consistency), model collapse can occur arbitrarily quickly or even after a single iteration, highlighting the necessity of smoothness-like conditions for stability. Together, these results offer a precise boundary between when iterative data accumulation is safe and when it is prone to rapid degradation, guiding theoretical understanding and future work on mitigating collapse in practice.

Abstract

The widespread use of generative models has created a feedback loop, in which each generation of models is trained on data partially produced by its predecessors. This process has raised concerns about model collapse: A critical degradation in performance caused by repeated training on synthetic data. However, different analyses in the literature have reached different conclusions as to the severity of model collapse. As such, it remains unclear how concerning this phenomenon is, and under which assumptions it can be avoided. To address this, we theoretically study model collapse for maximum likelihood estimation (MLE), in a natural setting where synthetic data is gradually added to the original data set. Under standard assumptions (similar to those long used for proving asymptotic consistency and normality of MLE), we establish non-asymptotic bounds showing that collapse can be avoided even as the fraction of real data vanishes. On the other hand, we prove that some assumptions (beyond MLE consistency) are indeed necessary: Without them, model collapse can occur arbitrarily quickly, even when the original data is still present in the training set. To the best of our knowledge, these are the first rigorous examples of iterative generative modeling with accumulating data that rapidly leads to model collapse.

Paper Structure

This paper contains 23 sections, 33 theorems, 165 equations, 4 figures, 1 algorithm.

Key Result

Theorem 4.1

Under Assumptions ass: regularity - ass: fisher, there exist constants $c,C>0$ which depend only on $K_1,K_2,K_3, \lambda_0$ and $r$, such that for any $T\in\mathbb{N}$, $\delta>0$ and any $n \geq c\left(\log(T)+1\right)^2\log^2\left(\frac{7dT}{\delta}\right)$, it holds with probability at least $1-

Figures (4)

  • Figure 1: MLE for a one-dimensional Gaussian distribution.
  • Figure 2: MLE for a one-dimensional Exponential distribution.
  • Figure 3: MLE with respect to a Beta distribution family with PDFs given by $p(x ; \theta) = \theta x^{\theta - 1}$ for $\theta > 0$ and $x\in(0,1)$.
  • Figure 4: MLE with respect to a Beta distribution family with PDFs given by $p(x ; \theta) = \theta x^{\theta - 1}$ for $\theta > 0$ and $x\in(0,1)$, for various choices of real parameter $\theta_0$.

Theorems & Definitions (58)

  • Definition 3.1
  • Theorem 4.1
  • Definition 5.1
  • Theorem 5.1
  • Theorem 5.2
  • Theorem A.1
  • Theorem A.2
  • Corollary A.1
  • Theorem B.1: jin2019short Corollary 7
  • Theorem B.2: tropp2012user Theorem 7.1
  • ...and 48 more