Table of Contents
Fetching ...

The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model

Kaito Takanami, Takashi Takahashi, Ayaka Sakata

TL;DR

This paper analyzes optimal multi-stage self-distillation (SD) with a linear classifier on noisy Gaussian mixture data, using the replica method to obtain precise asymptotic generalization formulas in the proportional limit $N,M\to\infty$, $M/N\to\alpha$. It shows that denoising via hard pseudo-labels is the primary driver of SD improvements, with soft-label dark knowledge offering limited gains except in certain regimes. The work identifies practical heuristics—early stopping to limit stages and bias-fixing under label imbalance—that robustly enhance SD, and validates key theoretical predictions with CIFAR-10 experiments using pretrained ResNet backbones. Together, these results deepen understanding of SD mechanisms in noisy settings and guide effective multi-stage distillation strategies, while outlining directions for extending the analysis to non-linear and deep architectures.

Abstract

Self-distillation (SD), a technique where a model improves itself using its own predictions, has attracted attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD with a linear classifier for binary classification on noisy Gaussian mixture data. For the analysis, we employ the replica method from statistical physics. Our findings reveal that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically validate our theoretical findings derived from our toy model, we conduct additional experiments on CIFAR-10 classification using pretrained ResNet backbone. These results provide both theoretical and practical insights, advancing our understanding and application of SD in noisy settings.

The Effect of Optimal Self-Distillation in Noisy Gaussian Mixture Model

TL;DR

This paper analyzes optimal multi-stage self-distillation (SD) with a linear classifier on noisy Gaussian mixture data, using the replica method to obtain precise asymptotic generalization formulas in the proportional limit , . It shows that denoising via hard pseudo-labels is the primary driver of SD improvements, with soft-label dark knowledge offering limited gains except in certain regimes. The work identifies practical heuristics—early stopping to limit stages and bias-fixing under label imbalance—that robustly enhance SD, and validates key theoretical predictions with CIFAR-10 experiments using pretrained ResNet backbones. Together, these results deepen understanding of SD mechanisms in noisy settings and guide effective multi-stage distillation strategies, while outlining directions for extending the analysis to non-linear and deep architectures.

Abstract

Self-distillation (SD), a technique where a model improves itself using its own predictions, has attracted attention as a simple yet powerful approach in machine learning. Despite its widespread use, the mechanisms underlying its effectiveness remain unclear. In this study, we investigate the efficacy of hyperparameter-tuned multi-stage SD with a linear classifier for binary classification on noisy Gaussian mixture data. For the analysis, we employ the replica method from statistical physics. Our findings reveal that the primary driver of SD's performance improvement is denoising through hard pseudo-labels, with the most notable gains observed in moderately sized datasets. We also identify two practical heuristics to enhance SD: early stopping that limits the number of stages, which is broadly effective, and bias parameter fixing, which helps under label imbalance. To empirically validate our theoretical findings derived from our toy model, we conduct additional experiments on CIFAR-10 classification using pretrained ResNet backbone. These results provide both theoretical and practical insights, advancing our understanding and application of SD in noisy settings.

Paper Structure

This paper contains 53 sections, 3 theorems, 102 equations, 10 figures, 1 table, 1 algorithm.

Key Result

Proposition 4.1

Under the proportional asymptotic limit ($N, M\to\infty$, constrained by $M/N \to \alpha \in(0,\infty)$), the average generalization error of the $t$-SD model is given by where $H(x) = 1- \int_{-\infty}^x \d t \, e^{-t^2/2} /\sqrt{2\pi}$.

Figures (10)

  • Figure 1: Heat map of the improvement error $\mathcal{E}^{*0} - \mathcal{E}^{*1}$ at $\rho=0.4$ and $\theta=0$ in linear $1$-SD model.
  • Figure 2: (A) and (B): generalization error improvements in the optimal logistic 1-SD model using soft labels $(\mathcal{E}^{*0} - \mathcal{E}^{*1})$ and hard labels $(\mathcal{E}^{*0} - \mathcal{E}^{*1}_\text{Hard})$, respectively. (C): the ratio of the two: $(\mathcal{E}^{*0} - \mathcal{E}^{*1}_\text{Hard}) / (\mathcal{E}^{*0} - \mathcal{E}^{*1})$. Parameters: $\rho = 0.4$, $\Delta = 1.0$.
  • Figure 3: (A) A comparison of the optimal generalization error for the linear $t$-SD model, $0$-SD model, and the noiseless case. (B) Dynamics of $\sqrt{\Delta Q^{tt}}$ and $m^t$ for the linear $t$-SD model with $\lambda^0, \cdots, \lambda^t \to\infty$. (C) Comparison of generalization error between the linear $t$-SD model with $\lambda^0, \cdots, \lambda^t \to\infty$ and optimal $t$-SD. Parameters for (A): $\rho = 0.5, \Delta = 0.5, \theta = 0.4$. (B, C): $\alpha = 1.0, \Delta = 1.2, \theta = 0.3, \beta^t = 1/\sqrt{Q^{t-1,t-1}}$.
  • Figure 4: (A) Optimal generalization error of the linear $t$-SD model compared with the $0$-SD model and the noiseless case under label imbalance ($\rho = 0.4$). (B) Evolution of the rescaled bias ($|b^t|/\sqrt{Q^{tt}}$) and alignment ($m^t/\sqrt{Q^{tt}}$) from $t = 0$ to $t = 4$ for the optimal $t$-SD model (solid lines) and the variant with fixed bias (dotted lines). (C) Generalization error over stages $t$ for the same models as in (B). Parameters: (A) $\rho = 0.4$, $\Delta = 0.5$, $\theta = 0.4$; (B, C) $\Delta = 1.0$, $\theta = 0.4$, $\alpha = 10.0$.
  • Figure 5: Comparison of the optimal generalization error of the logistic $0$-SD model, $1$-SD model and $1$-SD model using hard pseudo labels for CIFAR-10 dog versus cat classification using pretrained ResNet-18 ($N=512$) and ResNet-50 ($N=2048$) feature representations. Parameters: $\theta=0.4$. Error bars represent the standard error of the mean over 10 trials per point.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 4.1
  • Theorem C.1
  • Proposition F.1