Table of Contents
Fetching ...

Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

Mingqi Wu, Archer Y. Yang, Qiang Sun

TL;DR

The work analyzes iterative self-training in overparameterized linear regression, revealing a fundamental tension between denoising the stochastic error from the initial label noise and forgetting signal directions through repeated teacher–student transfers, which yields a $U$-shaped generalization curve and motivates early stopping. It develops deterministic equivalents for the risk via fixed-point equations and spectral-shrinkage operators, and proves concentration of empirical risk to these limits, enabling fully data-driven stopping with iterated GCV. The results further show iteration acts as a spectral filter that preserves strong eigen-directions while suppressing weaker ones, effectively producing soft feature selection beyond ridge regression. Experiments on synthetic covariance settings and a ResNet-50 model on CIFAR-10 support the theory and illustrate the practical impact of the denoising–forgetting trade-off in self-training contexts.

Abstract

Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.

Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting

TL;DR

The work analyzes iterative self-training in overparameterized linear regression, revealing a fundamental tension between denoising the stochastic error from the initial label noise and forgetting signal directions through repeated teacher–student transfers, which yields a -shaped generalization curve and motivates early stopping. It develops deterministic equivalents for the risk via fixed-point equations and spectral-shrinkage operators, and proves concentration of empirical risk to these limits, enabling fully data-driven stopping with iterated GCV. The results further show iteration acts as a spectral filter that preserves strong eigen-directions while suppressing weaker ones, effectively producing soft feature selection beyond ridge regression. Experiments on synthetic covariance settings and a ResNet-50 model on CIFAR-10 support the theory and illustrate the practical impact of the denoising–forgetting trade-off in self-training contexts.

Abstract

Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a -shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.
Paper Structure (70 sections, 18 theorems, 131 equations, 6 figures, 1 algorithm)

This paper contains 70 sections, 18 theorems, 131 equations, 6 figures, 1 algorithm.

Key Result

Theorem 3.2

Assume Assumptions ass:indpX--ass:noise and Assumption ass:spikedmodel. In the limit $p,n\to\infty$ with $p/n\to \rho>1$ and $\tau = \rho-1$, the deterministic prediction risk $\mathcal{R}_{t}^{*} := \lim_{p,n\to\infty}\mathcal{R}_{t}$ admits the decomposition where the systematic error is and for $t\ge 1$, the stochastic error satisfies with initial condition $\mathcal{V}_{0}^{*} =\frac{\sigma

Figures (6)

  • Figure 1: Simulation results for the spiked covariance model ($s=25$). Solid lines denote theoretical predictions; markers show simulation averages over 10 trials. (a) Iteration substantially reduces generalization error in the overparameterized regime. (b) The $U$-shaped curves match our theoretical predictions: initial iterations reduce stochastic error, while excessive iterations eventually lead to an accumulation of systematic error.
  • Figure 2: Systematic vs. stochastic errors across iterations. The systematic error. (a) increases with $t$, reflecting signal forgetting, while the stochastic error. (b) decreases due to denoising. Each curve corresponds to a different aspect ratio $\rho$.
  • Figure 3: Performance under the spiked covariance model.(a) Test error over iterations $t$ for varying spike strengths $s$. The $U$-shaped curves reflect early denoising and late-stage signal forgetting. (b) Minimum error over $t$ versus optimally tuned ridge (minimum over $\lambda$). As signal strength $s$ increases, iterative self-training can surpass ridge.
  • Figure 4: Validation of the iterated GCV (iGCV) estimator. Solid lines denote the iGCV estimate; markers denote the empirical test risk. (a) iGCV tracks the risk across aspect ratios in the overparameterized regime. (b) iGCV captures the $U$-shaped risk trajectory over iterations and identifies the optimal early-stopping point.
  • Figure 5: General covariance simulations ($\Sigma_{ii}=1/i$). Solid lines denote theory and markers denote simulations (averaged over 10 trials). (a) Test error versus aspect ratio $\rho=p/n$ for fixed iteration counts $t$. Iteration lowers the error floor and suppresses the double-descent peak. (b) Test error versus iteration $t$ for fixed aspect ratios $\rho$. The $U$-shaped curves indicate that the denoising--forgetting trade-off persists under general feature correlations.
  • ...and 1 more figures

Theorems & Definitions (33)

  • Theorem 3.2: Asymptotic risk recursion for the spiked covariance model
  • Corollary 3.3: Stochastic error reduction
  • Corollary 3.4: Asymptotics for strong spikes
  • Theorem 3.6: Risk decomposition for the multi-spike model
  • Definition 4.1: Effective parameters
  • Theorem 4.2: Concentration around deterministic equivalents
  • Theorem 4.3: Uniform consistency of iterated GCV
  • Corollary 2.1: Special cases: no signal and isotropic features
  • proof
  • proof
  • ...and 23 more