Table of Contents
Fetching ...

Golden Ratio Weighting Prevents Model Collapse

Hengzhi He, Shirong Xu, Guang Cheng

TL;DR

This work investigates model collapse during recursive training when models are trained on mixtures of real and synthetic data. It introduces a fresh data augmentation framework that weights newly collected real data and prior synthetic data, and proves a unified expression for limiting estimation error across Gaussian, GLM, and nonparametric settings. The key finding is a universal optimal weight $w^star = rac{ oot 2 ofill{}{oldsymbol{k^2+4k}} - k}{2}$, which becomes the reciprocal of the golden ratio when $k=1$, and that naive unweighted mixing is suboptimal. Empirical validation on simulations and a real Adult dataset confirms the theory and provides practical guidelines for mitigating model collapse via weighted data integration.

Abstract

Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation, generalized linear models, and nonparametric estimation. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and model performance. In some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.

Golden Ratio Weighting Prevents Model Collapse

TL;DR

This work investigates model collapse during recursive training when models are trained on mixtures of real and synthetic data. It introduces a fresh data augmentation framework that weights newly collected real data and prior synthetic data, and proves a unified expression for limiting estimation error across Gaussian, GLM, and nonparametric settings. The key finding is a universal optimal weight , which becomes the reciprocal of the golden ratio when , and that naive unweighted mixing is suboptimal. Empirical validation on simulations and a real Adult dataset confirms the theory and provides practical guidelines for mitigating model collapse via weighted data integration.

Abstract

Recent studies identified an intriguing phenomenon in recursive generative model training known as model collapse, where models trained on data generated by previous models exhibit severe performance degradation. Addressing this issue and developing more effective training strategies have become central challenges in generative model research. In this paper, we investigate this phenomenon within a novel framework, where generative models are iteratively trained on a combination of newly collected real data and synthetic data from the previous training step. To develop an optimal training strategy for integrating real and synthetic data, we evaluate the performance of a weighted training scheme in various scenarios, including Gaussian distribution estimation, generalized linear models, and nonparametric estimation. We theoretically characterize the impact of the mixing proportion and weighting scheme of synthetic data on the final model's performance. Our key finding is that, across different settings, the optimal weighting scheme under different proportions of synthetic data asymptotically follows a unified expression, revealing a fundamental trade-off between leveraging synthetic data and model performance. In some cases, the optimal weight assigned to real data corresponds to the reciprocal of the golden ratio. Finally, we validate our theoretical results on extensive simulated datasets and a real tabular dataset.

Paper Structure

This paper contains 33 sections, 269 equations, 11 figures, 3 algorithms.

Figures (11)

  • Figure 1: The general framework of model collapse phenomenon during recursive training shumailov2024ai. In this framework, $D_0$ denotes the initial real dataset and $\widetilde{D}_{t}$ denotes the synthetic dataset generated by the $(t-1)$-th generative model, which is then used to train the $t$-th generative model.
  • Figure 2: The general framework for mitigating the model collapse phenomenon during recursive training involves accumulating data gerstgrasser2024modelkazdan2024collapse. In this framework, the training dataset is progressively expanded at each step $t$ by incorporating synthetic data from last generative model.
  • Figure 3: The general framework for avoiding the model collapse phenomenon during recursive training by augmenting only real data bertrand2024on. In this framework, at the $t$-th training step, the generative model is trained based on $\widetilde{\mathcal{D}}_{t} \cup \mathcal{D}_0$.
  • Figure 4: The general framework considered in this paper involves mixing newly collected real and synthetic data to address model collapse. In this framework, at the $t$-th training step, a newly collected dataset $\mathcal{D}_t$ is augmented with $\widetilde{\mathcal{D}}_{t}$ to train the $t$-th generative model.
  • Figure 5: The general framework of recursive model estimation.
  • ...and 6 more figures