The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model
Kaito Takanami, Takashi Takahashi, Ayaka Sakata
TL;DR
This paper analyzes optimal multi-stage self-distillation (SD) with a linear classifier on noisy Gaussian mixture data, using the replica method to obtain precise asymptotic generalization formulas in the proportional limit $N,M\to\infty$, $M/N\to\alpha$. It shows that denoising via hard pseudo-labels is the primary driver of SD improvements, with soft-label dark knowledge offering limited gains except in certain regimes. The work identifies practical heuristics—early stopping to limit stages and bias-fixing under label imbalance—that robustly enhance SD, and validates key theoretical predictions with CIFAR-10 experiments using pretrained ResNet backbones. Together, these results deepen understanding of SD mechanisms in noisy settings and guide effective multi-stage distillation strategies, while outlining directions for extending the analysis to non-linear and deep architectures.
Abstract
Self-distillation (SD), a technique where a model improves itself using its own predictions, has attracted attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD with a linear classifier for binary classification on noisy Gaussian mixture data. For the analysis, we employ the replica method from statistical physics. Our findings reveal that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically validate our theoretical findings derived from our toy model, we conduct additional experiments on CIFAR-10 classification using pretrained ResNet backbone. These results provide both theoretical and practical insights, advancing our understanding and application of SD in noisy settings.
