Why Self-Training Helps and Hurts: Denoising vs. Signal Forgetting
Mingqi Wu, Archer Y. Yang, Qiang Sun
TL;DR
The work analyzes iterative self-training in overparameterized linear regression, revealing a fundamental tension between denoising the stochastic error from the initial label noise and forgetting signal directions through repeated teacher–student transfers, which yields a $U$-shaped generalization curve and motivates early stopping. It develops deterministic equivalents for the risk via fixed-point equations and spectral-shrinkage operators, and proves concentration of empirical risk to these limits, enabling fully data-driven stopping with iterated GCV. The results further show iteration acts as a spectral filter that preserves strong eigen-directions while suppressing weaker ones, effectively producing soft feature selection beyond ridge regression. Experiments on synthetic covariance settings and a ResNet-50 model on CIFAR-10 support the theory and illustrate the practical impact of the denoising–forgetting trade-off in self-training contexts.
Abstract
Iterative self-training (self-distillation) repeatedly refits a model on pseudo-labels generated by its own predictions. We study this procedure in overparameterized linear regression: an initial estimator is trained on noisy labels, and each subsequent iterate is trained on fresh covariates with noiseless pseudo-labels from the previous model. In the high-dimensional regime, we derive deterministic-equivalent recursions for the prediction risk and effective noise across iterations, and prove that the empirical quantities concentrate sharply around these limits. The recursion separates two competing forces: a systematic component that grows with iteration due to progressive signal forgetting, and a stochastic component that decays due to denoising via repeated data-dependent projections. Their interaction yields a $U$-shaped test-risk curve and an optimal early-stopping time. In spiked covariance models, iteration further acts as an iteration-dependent spectral filter that preserves strong eigendirections while suppressing weaker ones, inducing an implicit form of soft feature selection distinct from ridge regression. Finally, we propose an iterated generalized cross-validation criterion and prove its uniform consistency for estimating the risk along the self-training trajectory, enabling fully data-driven selection of the stopping time and regularization. Experiments on synthetic covariances validate the theory and illustrate the predicted denoising-forgetting trade-off.
