Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

Kevin Wang; Hongqian Niu; Didong Li

Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

Kevin Wang, Hongqian Niu, Didong Li

TL;DR

It is shown that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model's convergence rate and the fraction of real data used in each iteration, the first (positive) theoretical result on recursive training without distributional assumptions on the data.

Abstract

Generative Artificial Intelligence (AI), such as large language models (LLMs), has become a transformative force across science, industry, and society. As these systems grow in popularity, web data becomes increasingly interwoven with this AI-generated material and it is increasingly difficult to separate them from naturally generated content. As generative models are updated regularly, later models will inevitably be trained on mixtures of human-generated data and AI-generated data from earlier versions, creating a recursive training process with data contamination. Existing theoretical work has examined only highly simplified settings, where both the real data and the generative model are discrete or Gaussian, where it has been shown that such recursive training leads to model collapse. However, real data distributions are far more complex, and modern generative models are far more flexible than Gaussian and linear mechanisms. To fill this gap, we study recursive training in a general framework with minimal assumptions on the real data distribution and allow the underlying generative model to be a general universal approximator. In this framework, we show that contaminated recursive training still converges, with a convergence rate equal to the minimum of the baseline model's convergence rate and the fraction of real data used in each iteration. To the best of our knowledge, this is the first (positive) theoretical result on recursive training without distributional assumptions on the data. We further extend the analysis to settings where sampling bias is present in data collection and support all theoretical results with empirical studies.

Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

TL;DR

Abstract

Paper Structure (20 sections, 4 theorems, 66 equations, 71 figures, 7 tables)

This paper contains 20 sections, 4 theorems, 66 equations, 71 figures, 7 tables.

Introduction
Previous Work
Recursive Training under Data Contamination
Biased Recursive Training under Data Contamination
Simulations
Contaminated Recursive Training (CRT)
KDE with Varying Bandwidth
WGAN Style Generator
Biased Recursive Training (BCRT)
Real Data Experiments
Discussion and Future Work
Code Availability
Proofs
Proof of \ref{['thm:recursive_convergence']}
Proof of \ref{['thm:recursive_convergence_bias']}
...and 5 more sections

Key Result

Theorem 3.4

Suppose ass:poly and ass:convex hold. Let $\{\widehat{\mathbb{P}}_t\}_{t\ge0}$ be the sequence of CRT learned generators, then Equivalently, up to logarithms, Additionally, if ass:poly is weakened to convergence in probability, the same rates hold in probability.

Figures (71)

Figure 1: Horizontal axis: convergence rate of the baseline generative model; Vertical axis: the fraction of real data. The color indicates which quantity controls the overall convergence rate: red corresponds to the regime in which the rate is limited by the real-data fraction, blue corresponds to the regime in which the rate is limited by the baseline rate, and the diagonal line marks the phase transition between these two regimes.
Figure 2: Density function of a two-Gaussian mixture with parameters $w_1=0.35$, $(\mu_1,\sigma_1)=(-2.0,0.8)$, and $(\mu_2,\sigma_2)=(1.0,1.3)$ used in Simulation of Section 5.1.
Figure 3: CRT Simulation for KDE. Top row uses an ECDF estimator (KDE with bandwidth $=0$); bottom row uses KDE with initial bandwidth $h=0.5$. Left column shows $W_1$ distance, right column MMD distance with fixed bandwidth Gaussian kernel. Blue curves show empirical convergence rates while red curves show theoretical rates.
Figure 4: CRT Simulation for WGAN. Mean distributional distance for $W_1$ (left) and MMD with fixed bandwidth Gaussian kernel (right) over 50 replicates is plotted against $\alpha$, the real data fraction introduced at each iteration, for $W_1$ loss demonstrating the phase transition. In blue is the mean empirical convergence rate observed, while red indicates the theoretical rate.
Figure 12: BCRT Simulation (ECDF Estimator). Final output distributions across combinations of real data fraction $\alpha \in \{0.25,0.5,0.75\}$ and quantile level $q \in \{0.25,0.5,0.75\}$.
...and 66 more figures

Theorems & Definitions (11)

Definition 3.1: Generative model
Definition 3.2: Convergence rate
Definition 3.3: Contaminated Recursive Training (CRT)
Theorem 3.4: Convergence rate under CRT
Definition 4.1: Biased contaminated recursive training
Corollary 4.2: Convergence to a biased distribution
Theorem 4.3: Convergence rate under BCRT
proof
Lemma B.1: Cesàro rate for drifting distributions
proof
...and 1 more

Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

TL;DR

Abstract

Can Generative Artificial Intelligence Survive Data Contamination? Theoretical Guarantees under Contaminated Recursive Training

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (71)

Theorems & Definitions (11)