Table of Contents
Fetching ...

Combining Generative Artificial Intelligence (AI) and the Internet: Heading towards Evolution or Degradation?

Gonzalo Martínez, Lauren Watson, Pedro Reviriego, José Alberto Hernández, Marc Juarez, Rik Sarkar

TL;DR

The paper investigates whether repeatedly training generative AI systems on data that increasingly includes AI-generated content leads to performance degradation. Using a simple diffusion-based setup with the Oxford Flowers dataset, it simulates multiple generations (V1–V4) under varying proportions of AI-generated data (α values) and finds consistent degradation in image quality as AI content accumulates. Even increasing training epochs fails to prevent degeneration, suggesting possible limits to self-reinforcing AI data pipelines. The authors discuss significant limitations and outline future work, emphasizing larger datasets, more complex generative systems, and bias/diversity concerns, which have important implications for the sustainability and fairness of AI-generated data ecosystems.

Abstract

In the span of a few months, generative Artificial Intelligence (AI) tools that can generate realistic images or text have taken the Internet by storm, making them one of the technologies with fastest adoption ever. Some of these generative AI tools such as DALL-E, MidJourney, or ChatGPT have gained wide public notoriety. Interestingly, these tools are possible because of the massive amount of data (text and images) available on the Internet. The tools are trained on massive data sets that are scraped from Internet sites. And now, these generative AI tools are creating massive amounts of new data that are being fed into the Internet. Therefore, future versions of generative AI tools will be trained with Internet data that is a mix of original and AI-generated data. As time goes on, a mixture of original data and data generated by different versions of AI tools will populate the Internet. This raises a few intriguing questions: how will future versions of generative AI tools behave when trained on a mixture of real and AI generated data? Will they evolve with the new data sets or degenerate? Will evolution introduce biases in subsequent generations of generative AI tools? In this document, we explore these questions and report some very initial simulation results using a simple image-generation AI tool. These results suggest that the quality of the generated images degrades as more AI-generated data is used for training thus suggesting that generative AI may degenerate. Although these results are preliminary and cannot be generalised without further study, they serve to illustrate the potential issues of the interaction between generative AI and the Internet.

Combining Generative Artificial Intelligence (AI) and the Internet: Heading towards Evolution or Degradation?

TL;DR

The paper investigates whether repeatedly training generative AI systems on data that increasingly includes AI-generated content leads to performance degradation. Using a simple diffusion-based setup with the Oxford Flowers dataset, it simulates multiple generations (V1–V4) under varying proportions of AI-generated data (α values) and finds consistent degradation in image quality as AI content accumulates. Even increasing training epochs fails to prevent degeneration, suggesting possible limits to self-reinforcing AI data pipelines. The authors discuss significant limitations and outline future work, emphasizing larger datasets, more complex generative systems, and bias/diversity concerns, which have important implications for the sustainability and fairness of AI-generated data ecosystems.

Abstract

In the span of a few months, generative Artificial Intelligence (AI) tools that can generate realistic images or text have taken the Internet by storm, making them one of the technologies with fastest adoption ever. Some of these generative AI tools such as DALL-E, MidJourney, or ChatGPT have gained wide public notoriety. Interestingly, these tools are possible because of the massive amount of data (text and images) available on the Internet. The tools are trained on massive data sets that are scraped from Internet sites. And now, these generative AI tools are creating massive amounts of new data that are being fed into the Internet. Therefore, future versions of generative AI tools will be trained with Internet data that is a mix of original and AI-generated data. As time goes on, a mixture of original data and data generated by different versions of AI tools will populate the Internet. This raises a few intriguing questions: how will future versions of generative AI tools behave when trained on a mixture of real and AI generated data? Will they evolve with the new data sets or degenerate? Will evolution introduce biases in subsequent generations of generative AI tools? In this document, we explore these questions and report some very initial simulation results using a simple image-generation AI tool. These results suggest that the quality of the generated images degrades as more AI-generated data is used for training thus suggesting that generative AI may degenerate. Although these results are preliminary and cannot be generalised without further study, they serve to illustrate the potential issues of the interaction between generative AI and the Internet.
Paper Structure (4 sections, 7 figures)

This paper contains 4 sections, 7 figures.

Figures (7)

  • Figure 1: Simulation model for the evolution of generative AI
  • Figure 2: Examples of real flowers, Original dataset
  • Figure 3: Examples of synthetic flowers, $V_1$ generated set
  • Figure 4: Examples of images generated by the second $V_2$ (top), third $V_3$ (middle), and fourth $V_4$ (bottom) diffusion models trained with the original images and subsequent synthetic versions ($\alpha=1$)
  • Figure 5: Examples of images generated by the second $V_2$ (top), third $V_3$ (middle), and fourth $V_4$ (bottom) diffusion models trained with the original images and subsequent synthetic versions ($\alpha=0.5$)
  • ...and 2 more figures