Table of Contents
Fetching ...

Future of AI Models: A Computational perspective on Model collapse

Trivikram Satharasi, S Sitharama Iyengar

TL;DR

The paper addresses whether AI-generated content can progressively contaminate training data and precipitate Model Collapse. It employs a data-driven framework that tracks year-by-year semantic similarity in a filtered Common Crawl English Wikipedia corpus using Transformer embeddings and cosine similarity, fitting an exponential growth model to project saturation timelines. The results show a rising trend in similarity beginning before public LLM adoption and project critical thresholds around mid-2030s, suggesting potential risks to data richness and generalization if unmitigated. The work highlights the importance of continuous monitoring and data governance to preserve linguistic diversity and the robustness of future AI systems.

Abstract

Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.

Future of AI Models: A Computational perspective on Model collapse

TL;DR

The paper addresses whether AI-generated content can progressively contaminate training data and precipitate Model Collapse. It employs a data-driven framework that tracks year-by-year semantic similarity in a filtered Common Crawl English Wikipedia corpus using Transformer embeddings and cosine similarity, fitting an exponential growth model to project saturation timelines. The results show a rising trend in similarity beginning before public LLM adoption and project critical thresholds around mid-2030s, suggesting potential risks to data richness and generalization if unmitigated. The work highlights the importance of continuous monitoring and data governance to preserve linguistic diversity and the robustness of future AI systems.

Abstract

Artificial Intelligence, especially Large Language Models (LLMs), has transformed domains such as software engineering, journalism, creative writing, academia, and media (Naveed et al. 2025; arXiv:2307.06435). Diffusion models like Stable Diffusion generate high-quality images and videos from text. Evidence shows rapid expansion: 74.2% of newly published webpages now contain AI-generated material (Ryan Law 2025), 30-40% of the active web corpus is synthetic (Spennemann 2025; arXiv:2504.08755), 52% of U.S. adults use LLMs for writing, coding, or research (Staff 2025), and audits find AI involvement in 18% of financial complaints and 24% of press releases (Liang et al. 2025). The underlying neural architectures, including Transformers (Vaswani et al. 2023; arXiv:1706.03762), RNNs, LSTMs, GANs, and diffusion networks, depend on large, diverse, human-authored datasets (Shi & Iyengar 2019). As synthetic content dominates, recursive training risks eroding linguistic and semantic diversity, producing Model Collapse (Shumailov et al. 2024; arXiv:2307.15043; Dohmatob et al. 2024; arXiv:2402.07712). This study quantifies and forecasts collapse onset by examining year-wise semantic similarity in English-language Wikipedia (filtered Common Crawl) from 2013 to 2025 using Transformer embeddings and cosine similarity metrics. Results reveal a steady rise in similarity before public LLM adoption, likely driven by early RNN/LSTM translation and text-normalization pipelines, though modest due to a smaller scale. Observed fluctuations reflect irreducible linguistic diversity, variable corpus size across years, finite sampling error, and an exponential rise in similarity after the public adoption of LLM models. These findings provide a data-driven estimate of when recursive AI contamination may significantly threaten data richness and model generalization.

Paper Structure

This paper contains 15 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Schematic Representation of a Transformer Encoder "BERT", A multi-layer attention-based architecture, whose processing can be summarized as a sequence of Multi-Head Attention, Residual Addition and Normalization, and Feed-Forward layers repeated N times. This is also the transformer used for this work. Image sourced fromcastellucci2019multilingualintentdetectionslot
  • Figure 2: Average cosine similarity from 2013 to 2025 showing an increasing trend in homogeneity driven by Synthetic AI generated textual data, that has significantly increased after the 2017 breakthrough in Natural Language processing using transformers and then the public adoption of Transformer based LLMs like ChatGPT API gpt3gpt2 in late 2022.