Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang; Pratik Patil; Alessandro Rinaldo

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Hien Dang, Pratik Patil, Alessandro Rinaldo

TL;DR

It is proved that for any squared prediction risk, the optimally mixed student strictly improves upon the ridge teacher for every regularization level, and a consistent one-shot tuning method is proposed to estimate $\xi^\star$ without grid search, sample splitting, or refitting.

Abstract

Self-distillation (SD) is the process of retraining a student on a mixture of ground-truth labels and the teacher's own predictions using the same architecture and training data. Although SD has been empirically shown to often improve generalization, its formal guarantees remain limited. We study SD for ridge regression in unconstrained setting in which the mixing weight $ξ$ may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level $λ> 0$ at which the teacher ridge risk $R(λ)$ is nonstationary (i.e., $R'(λ) \neq 0$). We obtain a closed-form expression for the optimal mixing weight $ξ^\star(λ)$ for any value of $λ$ and show that it obeys the sign rule: $\operatorname{sign}(ξ^\star(λ))=-\operatorname{sign}(R'(λ))$. In particular, $ξ^\star(λ)$ can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes $n$ and $p$ both diverge but their aspect ratio $p/n$ converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate $ξ^\star$ without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

TL;DR

without grid search, sample splitting, or refitting.

Abstract

may be outside the unit interval. Conditioned on the training data and without any distributional assumptions, we prove that for any squared prediction risk (including out-of-distribution), the optimally mixed student strictly improves upon the ridge teacher for every regularization level

at which the teacher ridge risk

is nonstationary (i.e.,

). We obtain a closed-form expression for the optimal mixing weight

for any value of

and show that it obeys the sign rule:

. In particular,

can be negative, which is the case in over-regularized regimes. To quantify the risk improvement due to SD, we derive exact deterministic equivalents for the optimal SD risk in the proportional asymptotics regime (where the sample and feature sizes

and

both diverge but their aspect ratio

converges) under general anisotropic covariance and deterministic signals. Our asymptotic analysis extends standard second-order ridge deterministic equivalents to their fourth-order analogs using block linearization, which may be of independent interest. From a practical standpoint, we propose a consistent one-shot tuning method to estimate

without grid search, sample splitting, or refitting. Experiments on real-world datasets and pretrained neural network features support our theory and the one-shot tuning method.

Paper Structure (82 sections, 39 theorems, 286 equations, 25 figures, 4 tables)

This paper contains 82 sections, 39 theorems, 286 equations, 25 figures, 4 tables.

Introduction
Summary of Paper Contributions and Outline
Related Works and Comparisons
Structural Nonasymptotic Results
Self-Distillation with Ridge Regression
Optimal SD Risk Decomposition
Strict Pointwise Improvement and Sign of Optimal Mixing Weight
Can Self-Distillation Beat Optimally Tuned Ridge?
Proportional Asymptotic Results
Data Assumptions
Asymptotics of Optimal Self-Distillation Risk and Mixing Weight
Self-Distillation Risks with Extreme Regularization
One-Shot Tuning and Risk Estimation
Risk Estimators via Generalized Cross-Validation
One-Shot Estimators for Optimal Mixing Weight and Optimal SD Risk
...and 67 more sections

Key Result

Proposition 2.1

Fix $\lambda > 0$ and assume $D(\lambda) > 0$. Then, In particular, $R_{\mathsf{sd}}^\star(\lambda) \le R_{}(\lambda)$ for all $\lambda$, and $\xi^\star(\lambda)$ may be negative.

Figures (25)

Figure 1: Visual illustration of the self-distillation process.
Figure 2: Strict improvement of SD risk with unconstrained mixing. Test squared prediction risk of ridge regression ($R$, in blue), pure-distilled ridge ($R_{\mathsf{pd}}$, in light blue) and optimal self-distilled ridge ($R^{\star}_{\mathsf{sd}}$, in green) as functions of the ridge penalty $\lambda$. Results are shown on raw features from real-world datasets: BlogFeedback and Communities and Crime datasets, and on pretrained ResNet-18 features. The optimal mixing parameter $\xi^{\star}(\lambda)$ is in red and the one-shot risk estimate $\widehat{R}_{\mathsf{sd}}^\star(\lambda)$ computed from the training data is shown in green dashed line. Note that $\xi^\star(\lambda)$ lies in $[0,1]$ only for a narrow range of $\lambda$ and can be strongly negative for large $\lambda$. We also observe that: (i) $R_{\mathsf{sd}}^{\star}(\lambda)$ is strictly smaller than $R(\lambda)$ at every $\lambda$ that is not the stationary point of $R(\lambda)$, (ii) the sign of $\xi^{\star}(\lambda)$ is opposite to the sign of $R^{\prime}(\lambda)$, and (iii) the sign change of $\xi^{\star}$ happens at the stationary point of $R(\lambda)$. (Experiments with $\xi$ restricted to $[0,1]$ appear in \ref{['fig:exp_real_world_restricted']}.)
Figure 3: Out-of-distribution SD risk improvement Test prediction risk of ridge and optimal SD ridge on Air Quality dataset (see \ref{['sec:additional_details']} for more details). SD yields strict improvements across $\lambda$ and achieves a substantially smaller global minimum.
Figure 4: Theoretical versus empirical risks. Asymptotic SD risks and optimal mixing weights versus $\lambda$ across multiple $\textsc{snr}\xspace$ values. Empirical curves are averaged over $30$ simulations. Estimated curves are obtained using the proposed one-shot tuning method (\ref{['sec:tuning']}), and the theoretical curves are obtained from \ref{['thm:risk-asymptotics']}. Setting: $n = 400$, $p =200$, $\sigma^2 = 1$, $r^2 = \sigma^2 \textsc{snr}\xspace$; $\Sigma$ is AR1, $\beta$ is a deterministic signal aligned with the top $10\%$ eigenvectors of $\Sigma$ (alignment factor 0.9). (See \ref{['sec:additional_details']} for more details.)
Figure 5: Asymptotic gain over the teacher. Deterministic limits of the risks and the gain $\mathcal{R}(\lambda)-\mathcal{R}_{\mathsf{sd}}^\star(\lambda)$. Same setting as \ref{['fig:asymptotic_ar1_top_aligned']} with $r^2 = \sigma^2 = 1$.
...and 20 more figures

Theorems & Definitions (40)

Proposition 2.1: Optimal SD risk decomposition
theorem 2.2: Strict improvement and sign rule
Proposition 2.3: Curvature test at the ridge-optimal $\lambda$
theorem 3.1: Risk asymptotics
Corollary 3.2
Proposition 3.3: Comparison with the optimal ridge
theorem 4.1: Consistency of one-shot SD tuning
Proposition 5.1: Monotonicity of optimal recursive multi-round self-distillation
theorem 5.2: Same-$X$ dominates fresh-$X$
theorem 5.3: Ridge-smoother strict improvement and sign rule
...and 30 more

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

TL;DR

Abstract

Optimal Unconstrained Self-Distillation in Ridge Regression: Strict Improvements, Precise Asymptotics, and One-Shot Tuning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (25)

Theorems & Definitions (40)