Table of Contents
Fetching ...

Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

Weiguo Gao, Ming Li

TL;DR

Co-evolving generative models that shape each other's training through iterative feedback are studied, common in multimodal AI ecosystems, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model.

Abstract

The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other's training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.

Convergence Dynamics and Stabilization Strategies of Co-Evolving Generative Models

TL;DR

Co-evolving generative models that shape each other's training through iterative feedback are studied, common in multimodal AI ecosystems, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model.

Abstract

The increasing prevalence of synthetic data in training loops has raised concerns about model collapse, where generative models degrade when trained on their own outputs. While prior work focuses on this self-consuming process, we study an underexplored yet prevalent phenomenon: co-evolving generative models that shape each other's training through iterative feedback. This is common in multimodal AI ecosystems, such as social media platforms, where text models generate captions that guide image models, and the resulting images influence the future adaptation of the text model. We take a first step by analyzing such a system, modeling the text model as a multinomial distribution and the image model as a conditional multi-dimensional Gaussian distribution. Our analysis uncovers three key results. First, when one model remains fixed, the other collapses: a frozen image model causes the text model to lose diversity, while a frozen text model leads to an exponential contraction of image diversity, though fidelity remains bounded. Second, in fully interactive systems, mutual reinforcement accelerates collapse, with image contraction amplifying text homogenization and vice versa, leading to a Matthew effect where dominant texts sustain higher image diversity while rarer texts collapse faster. Third, we analyze stabilization strategies implicitly introduced by real-world external influences. Random corpus injections for text models and user-content injections for image models prevent collapse while preserving both diversity and fidelity. Our theoretical findings are further validated through experiments.

Paper Structure

This paper contains 52 sections, 21 theorems, 200 equations, 11 figures, 3 algorithms.

Key Result

Theorem 3.1

In alg:co-evolving_generative_models, assume that the image model is frozen, meaning its conditional distributions $q({\bm{y}} | x_i)$ remain fixed (hence, we omit the time subscript $t$ in $q_t({\bm{y}}|x_i)$), and that the text model is updated. Then, in expectation, the diversity measure $H_t$ gi where is the posterior probability, $N$ is the number of generated images in each update, and $r_t

Figures (11)

  • Figure 1: An overview of the setup of experiments in \ref{['sec:experiments']}. The text model is a discrete distribution over $K=5$ text components, while the image model is a $2$-dimensional Gaussian mixture with $K=5$ components whose means are uniformly distributed on the unit circle. The gray arrows indicate how generated texts are used to train the image model (via sample mean and covariance), how generated images are used to train the text model (via posterior probability), and how external information (e.g., corpus or user-content injection) can be introduced.
  • Figure 2: Evolution of text model diversity when the image model is frozen. In this experiment, the image model is initialized with $d=2$ by uniformly distributing its mean vectors on a unit circle, and its covariance matrices are set to $\sigma^2{\bm{I}}$ with $\sigma^2$ taking on values of $0.01$, $0.1$, $0.5$, $1$, and $10$. The text model is initialized with a uniform distribution over $K=5$ texts and updated at each macro time step using a batch of $N=1000$ samples; results are averaged over $100$ independent runs. Different line colors represent different covariance scales $\sigma^2$ (with lighter colors corresponding to larger values). The black dashed line indicate the theoretical lower bound of text model diversity convergence rate derived in \ref{['cor:lower_bound_text_model_diversity']}, i.e., $1-N^{-1}=0.999$. The results indicate that as $\sigma^2$ increases, the rate at which text model diversity decreases becomes slower. Conversely, as $\sigma^2$ decreases, the convergence rate increases, but is approximately bounded below by the theoretical lower bound. These are consistent with the theoretical predictions of \ref{['thm:text_model_diversity_frozen_image_model', 'thm:arbitrarily_small_convergence_large_covariance']}.
  • Figure 3: Evolution of image model diversity evolution when the text model is frozen. Here, the text model is initialized with ${\bm{p}}=(0.06,0.13,0.2,0.27,0.34)$ (for $K=5$ texts), and the image model is initialized as in \ref{['subsubsec:exp_text_model_diversity_under_the_frozen_image_model']} with $s=1$. The image model is updated using $N=1000$ samples per macro time step, and results are averaged over $100$ runs. Different line colors represent different text probabilities $p_i$ (with lighter colors corresponding to larger values). The colored dashed lines indicate the convergence rate predicted by \ref{['thm:diff_convergence_rate_image_model']} (i.e., $\rho(x_i) \approx 1-(d+1)/(8(N+1)p_i)$). The observed exponential decay in diversity is consistent with the theoretical prediction, especially when $t$ is relatively small and less affected by numerical round-off errors; however, deviations tend to emerge at later times likely due to the accumulation of numerical inaccuracies.
  • Figure 4: Evolution of text model diversity under varying frequencies of image model updates. In this experiment, the image model is initialized as in \ref{['subsubsec:exp_text_model_diversity_under_the_frozen_image_model']} with $s=1$, and the text model is updated using $N=1000$ samples per macro time step, with results averaged over $100$ runs. Different line colors correspond to different numbers $N_t$ of image updates performed between successive text model updates (with $N_t=0$ indicating a frozen image model and lighter colors corresponding to smaller values). The results show that, compared with the frozen image model, more frequent image updates accelerate the collapse of the text model, in agreement with the theoretical prediction in \ref{['thm:acceleratd_text_model_collapse']}.
  • Figure 5: Evolution of text model diversity under corpus injection. The plot displays the average diversity over macro time steps (with $100$ runs) for different injection fractions $\varepsilon$ (with fixed injection probability $\alpha=0.05$). Each curve corresponds to a different value of $\varepsilon$, with colors becoming lighter as $\varepsilon$ increases. The results illustrating that compared with the closed system (i.e., $\varepsilon=0$), even a small fraction of injected probability prevents the text model diversity from collapsing entirely, which are in line with the theoretical guarantees provided in \ref{['thm:stabilization_text_model_via_injection']}.
  • ...and 6 more figures

Theorems & Definitions (21)

  • Theorem 3.1: Recursion of text model diversity under frozen image model
  • Corollary 3.2: Text model diversity decays at most exponentially
  • Theorem 3.3: Image model diversity decays under frozen text model
  • Theorem 3.4: Boundedness of image model fidelity under the frozen text model
  • Theorem 4.1: Arbitrarily slow convergence under large covariances
  • Theorem 4.2: Exponential convergence under the trainable image model
  • Theorem 5.1: Differential convergence rate of the image model
  • Theorem 5.2: Matthew effect of image model diversity under text model collapse
  • Theorem 6.1: Stabilization of text model diversity under text injection
  • Theorem 6.2: Stabilization of image model diversity under image injection
  • ...and 11 more