Table of Contents
Fetching ...

Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression

Anvit Garg, Sohom Bhattacharya, Pragya Sur

TL;DR

This work addresses model collapse in overparametrized learning when training on synthetic data by introducing a fresh data augmentation scheme that blends real labels with synthetic labels generated from the model itself. It derives exact asymptotic generalization errors for both the min-$\ell_2$-norm interpolator and ridge regression under iterative data mixing, and identifies optimal mixing ratios: $w^* = 1/\varphi$ for the interpolator and $w^* \in [0.5,1]$ for ridge across several covariance structures. The results reveal how spectral geometry and problem parameters (covariance, SNR, regularization) govern optimal weighting, including dynamics where the adaptive mixing sequence converges to the same long-run risk as a fixed $w^*$. The findings provide principled guidance for mitigating model collapse in high-dimensional settings and are validated by extensive simulations, including random-effects and spiked covariance models. Overall, the paper extends the understanding of interpolation learning under augmentation and offers concrete, theory-backed strategies for safe synthetic-data use in overparametrized regimes.

Abstract

Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum-$\ell_2$-norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min-$\ell_2$-norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We study three additional settings: (i) where real data is fixed and fresh labels are not obtained at each iteration, (ii) where covariates vary across iterations but fresh real labels are available each time, and (iii) where covariates vary with time but only a fraction of them receive fresh real labels at each iteration. Across these diverse settings, we characterize when model collapse is inevitable and when synthetic data improves learning. We validate our theoretical results with extensive simulations.

Preventing Model Collapse Under Overparametrization: Optimal Mixing Ratios for Interpolation Learning and Ridge Regression

TL;DR

This work addresses model collapse in overparametrized learning when training on synthetic data by introducing a fresh data augmentation scheme that blends real labels with synthetic labels generated from the model itself. It derives exact asymptotic generalization errors for both the min--norm interpolator and ridge regression under iterative data mixing, and identifies optimal mixing ratios: for the interpolator and for ridge across several covariance structures. The results reveal how spectral geometry and problem parameters (covariance, SNR, regularization) govern optimal weighting, including dynamics where the adaptive mixing sequence converges to the same long-run risk as a fixed . The findings provide principled guidance for mitigating model collapse in high-dimensional settings and are validated by extensive simulations, including random-effects and spiked covariance models. Overall, the paper extends the understanding of interpolation learning under augmentation and offers concrete, theory-backed strategies for safe synthetic-data use in overparametrized regimes.

Abstract

Model collapse occurs when generative models degrade after repeatedly training on their own synthetic outputs. We study this effect in overparameterized linear regression in a setting where each iteration mixes fresh real labels with synthetic labels drawn from the model fitted in the previous iteration. We derive precise generalization error formulae for minimum--norm interpolation and ridge regression under this iterative scheme. Our analysis reveals intriguing properties of the optimal mixing weight that minimizes long-term prediction error and provably prevents model collapse. For instance, in the case of min--norm interpolation, we establish that the optimal real-data proportion converges to the reciprocal of the golden ratio for fairly general classes of covariate distributions. Previously, this property was known only for ordinary least squares, and additionally in low dimensions. For ridge regression, we further analyze two popular model classes -- the random-effects model and the spiked covariance model -- demonstrating how spectral geometry governs optimal weighting. In both cases, as well as for isotropic features, we uncover that the optimal mixing ratio should be at least one-half, reflecting the necessity of favoring real-data over synthetic. We study three additional settings: (i) where real data is fixed and fresh labels are not obtained at each iteration, (ii) where covariates vary across iterations but fresh real labels are available each time, and (iii) where covariates vary with time but only a fraction of them receive fresh real labels at each iteration. Across these diverse settings, we characterize when model collapse is inevitable and when synthetic data improves learning. We validate our theoretical results with extensive simulations.

Paper Structure

This paper contains 25 sections, 19 theorems, 104 equations, 5 figures, 4 algorithms.

Key Result

Theorem 3.1

In the setting of Section sec:formulate, the risk of $\widehat{\pmb{\beta}}_t$, defined by equation (eq:define_interpolator), satisifes the following. For any $w \in (0,1)$, we have almost surely over the randomness in the covariates, with $c(w) = (w^2+(1-w)^2)/w(2-w)$ and Moreover, the limiting risk is minimized at $w^\star = \varphi^{-1}$, where $\varphi = (1+\sqrt{5})/2$ is the golden ratio.

Figures (5)

  • Figure 1: Generalization error of min-$\ell_2$-norm interpolator as a function of weight $w$ (Panel $(a)$ and $(b)$) and iterations $t$ (Panel $(c)$) for different values of $\gamma$. Panel $(a)$ considers isotropic covariance ${\boldsymbol\Sigma}= \boldsymbol I$ and panel $(b)$ considers anisotropic ${\boldsymbol\Sigma}$ with ${\boldsymbol\Sigma}_{ij}= 2^{-|i-j|}$, which corresponds to covariance matrix of AR$(1)$ model. In panes $(a)$ and $(b)$, the risk is minimized at $w^\star= 1/\varphi$, as proved by Theorem \ref{['thm:interpolator']}. Panel $(c)$ shows that both empirical and theoretical risks stabilize in a few iterations.
  • Figure 2: In $(a)$ and $(b)$, we plot the optimal mixing weight $w^{\star}$ as a function of $\lambda$ for different values of $\gamma$ and two classes of covariance matrices: Panel (a) considers ${\boldsymbol\Sigma}= \boldsymbol I$ with high noise $\sigma^2= 64$, demonstrating $w^\star$ can be close to $0.5$ for low SNR. Panel $(b)$ plots $w^{\star}$ for the spiked covariance matrix showing $w^\star =1$ for large $\lambda$. Panel $(c)$ plots the generalization error as a function of $\lambda$ for ${\boldsymbol\Sigma}$ equicorrelated matrix. Here, empirical risks align with theoretical predictions given by Proposition \ref{['thm:anisotropic_structured']}, though ${\boldsymbol\Sigma}$ violates Assumption \ref{['assn:combined']}, illustrating the robustness of our results.
  • Figure 3: In $(a)$, we plot the optimal $w^\star$ for ridge regression as a function of $\gamma$. In $(b)$, we plot the empirical risk for the min-norm interpolator to demonstrate the dynamic mixing scenario described in Section \ref{['sec:interpol']}. Note that, the risk under dynamic mixing and $w=\varphi^{-1}$ heavily overlap for $t = 2$ onward. In $(c)$, we plot the empirical (the points) & theoretical (the solid lines) risks for the no fresh data augmentation case described by equation (\ref{['eq:no_new_define']}) and Theorem \ref{['thm:no_new_label']}. Note the risk is optimized at $w^\star=1$. Throughout ${\boldsymbol\Sigma}=\boldsymbol{I}$.
  • Figure 4: Panel $(a)$: We plot theoretical values of optimal mixing parameter $w^\star$ as a function of $\gamma$ as obtained in Theorem \ref{['thm:time_vary']}. We vary $\texttt{SNR}=\{0.5,1,2\}$. Our plots show that $w^\star$ is non-monotone as function of $\gamma$. Panel $(b)$: We empirically verify the generalization error of Theorem \ref{['thm:time_vary']}. For $\gamma= \{1.5,3\}$ we plot risk as a function of $w$. The dotted lines correspond to empirical risk and the solid lines represent theoretical risk. The vertical dashed lines correspond to optimal mixing value $w^\star$. The empirical risk is computed with $\texttt{SNR}=5$ and $n=1500$, iteration $t=20$.
  • Figure 5: We plot generalization error of $\widehat{\beta}^{(a)}_t$ to complement theoretical findings of Theorem \ref{['thm:pooled']}. The dotted lines correspond to $\log($Risk$)$ with $\texttt{SNR}=1$, and $n=100$ and different values of $\gamma$. The $x$-axis corresponds to number of iterations $t$, which means risk increases exponentially with iterations and resulting in model collapse.

Theorems & Definitions (32)

  • Remark 3.1
  • Theorem 3.1: Interpolator Risk
  • Theorem 3.2: Isotropic risk
  • Theorem 3.3: Isotropic Optimal Mixing
  • Theorem 3.4: Ridge risk under correlated covariates
  • Proposition 3.1: Ridge risk under Random-Effects Models
  • Proposition 3.2
  • Remark 3.2
  • Proposition 3.3
  • Theorem 5.1
  • ...and 22 more