Table of Contents
Fetching ...

A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'

Ali Borji

TL;DR

The effects of fitting a distribution or a model to the data, followed by repeated sampling from it, and the effects of fitting a distribution or a model to the data, followed by repeated sampling from it are investigated.

Abstract

The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.

A Note on Shumailov et al. (2024): `AI Models Collapse When Trained on Recursively Generated Data'

TL;DR

The effects of fitting a distribution or a model to the data, followed by repeated sampling from it, and the effects of fitting a distribution or a model to the data, followed by repeated sampling from it are investigated.

Abstract

The study conducted by Shumailov et al. (2024) demonstrates that repeatedly training a generative model on synthetic data leads to model collapse. This finding has generated considerable interest and debate, particularly given that current models have nearly exhausted the available data. In this work, we investigate the effects of fitting a distribution (through Kernel Density Estimation, or KDE) or a model to the data, followed by repeated sampling from it. Our objective is to develop a theoretical understanding of the phenomenon observed by Shumailov et al. (2024). Our results indicate that the outcomes reported are a statistical phenomenon and may be unavoidable.

Paper Structure

This paper contains 5 sections, 4 equations, 6 figures.

Figures (6)

  • Figure 1: AI produces gibberish when trained on too much AI-generated data. Figure from shumailov2024ai, https://www.nature.com/articles/d41586-024-02355-z.
  • Figure 2: A synthetic distribution consisting of two Gaussian components and a Uniform distribution. The original samples and those generated using KDE are displayed. Refer to the Appendix for the code.
  • Figure 3: Recursive KDE and sampling after 30 iterations, in steps of 3, corresponding to Figure \ref{['fig:exp1']}.
  • Figure 4: Left: original distributions and samples from the KDE after 300 iterations. Right: KL divergence and Wasserstein distance over 300 iterations.
  • Figure 5: Two additional composite distributions (one per column; see text). Top row: original distributions, fitted KDE, and samples generated from the KDE. Middle row: After 30 iterations, samples from the KDE exhibit a single mode and resemble Gaussian distributions. Bottom row: KL divergence and Wasserstein distance over 300 iterations.
  • ...and 1 more figures