Table of Contents
Fetching ...

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang, Pratik Patil, Alessandro Rinaldo

TL;DR

It is proved that for any squared prediction risk, the optimally mixed student strictly improves upon the ridge teacher for every regularization level, and a consistent one-shot tuning method is proposed to estimate $\xi^\star$ without grid search, sample splitting, or refitting.

Abstract

Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

TL;DR

It is proved that for any squared prediction risk, the optimally mixed student strictly improves upon the ridge teacher for every regularization level, and a consistent one-shot tuning method is proposed to estimate without grid search, sample splitting, or refitting.

Abstract

Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level at which the teacher ridge risk is nonstationary (i.e., ). We obtain a closed-form expression for the optimal mixing weight for any value of and show that it obeys the sign rule: . In particular, can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes and both diverge but their aspect ratio converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.
Paper Structure (82 sections, 39 theorems, 286 equations, 25 figures, 4 tables)

This paper contains 82 sections, 39 theorems, 286 equations, 25 figures, 4 tables.

Key Result

Proposition 2.1

Fix $\lambda > 0$ and assume $D(\lambda) > 0$. Then, In particular, $R_{\mathsf{sd}}^\star(\lambda) \le R_{}(\lambda)$ for all $\lambda$, and $\xi^\star(\lambda)$ may be negative.

Figures (25)

  • Figure 1: Visual illustration of the self-distillation process.
  • Figure 2: Strict improvement of SD risk with unconstrained mixing. Test squared prediction risk of ridge regression ($R$, in blue), pure-distilled ridge ($R_{\mathsf{pd}}$, in light blue) and optimal self-distilled ridge ($R^{\star}_{\mathsf{sd}}$, in green) as functions of the ridge penalty $\lambda$. Results are shown on raw features from real-world datasets: BlogFeedback and Communities and Crime datasets, and on pretrained ResNet-18 features. The optimal mixing parameter $\xi^{\star}(\lambda)$ is in red and the one-shot risk estimate $\widehat{R}_{\mathsf{sd}}^\star(\lambda)$ computed from the training data is shown in green dashed line. Note that $\xi^\star(\lambda)$ lies in $[0,1]$ only for a narrow range of $\lambda$ and can be strongly negative for large $\lambda$. We also observe that: (i) $R_{\mathsf{sd}}^{\star}(\lambda)$ is strictly smaller than $R(\lambda)$ at every $\lambda$ that is not the stationary point of $R(\lambda)$, (ii) the sign of $\xi^{\star}(\lambda)$ is opposite to the sign of $R^{\prime}(\lambda)$, and (iii) the sign change of $\xi^{\star}$ happens at the stationary point of $R(\lambda)$. (Experiments with $\xi$ restricted to $[0,1]$ appear in \ref{['fig:exp_real_world_restricted']}.)
  • Figure 3: Out-of-distribution SD risk improvement Test prediction risk of ridge and optimal SD ridge on Air Quality dataset (see \ref{['sec:additional_details']} for more details). SD yields strict improvements across $\lambda$ and achieves a substantially smaller global minimum.
  • Figure 4: Theoretical versus empirical risks. Asymptotic SD risks and optimal mixing weights versus $\lambda$ across multiple $\textsc{snr}\xspace$ values. Empirical curves are averaged over $30$ simulations. Estimated curves are obtained using the proposed one-shot tuning method (\ref{['sec:tuning']}), and the theoretical curves are obtained from \ref{['thm:risk-asymptotics']}. Setting: $n = 400$, $p =200$, $\sigma^2 = 1$, $r^2 = \sigma^2 \textsc{snr}\xspace$; $\Sigma$ is AR1, $\beta$ is a deterministic signal aligned with the top $10\%$ eigenvectors of $\Sigma$ (alignment factor 0.9). (See \ref{['sec:additional_details']} for more details.)
  • Figure 5: Asymptotic gain over the teacher. Deterministic limits of the risks and the gain $\mathcal{R}(\lambda)-\mathcal{R}_{\mathsf{sd}}^\star(\lambda)$. Same setting as \ref{['fig:asymptotic_ar1_top_aligned']} with $r^2 = \sigma^2 = 1$.
  • ...and 20 more figures

Theorems & Definitions (40)

  • Proposition 2.1: Optimal SD risk decomposition
  • theorem 2.2: Strict improvement and sign rule
  • Proposition 2.3: Curvature test at the ridge-optimal $\lambda$
  • theorem 3.1: Risk asymptotics
  • Corollary 3.2
  • Proposition 3.3: Comparison with the optimal ridge
  • theorem 4.1: Consistency of one-shot SD tuning
  • Proposition 5.1: Monotonicity of optimal recursive multi-round self-distillation
  • theorem 5.2: Same-$X$ dominates fresh-$X$
  • theorem 5.3: Ridge-smoother strict improvement and sign rule
  • ...and 30 more