Table of Contents
Fetching ...

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, Samet Oymak

TL;DR

This work develops a sharp, non-asymptotic analysis of knowledge distillation in high-dimensional ridgeless regression, addressing both model shift and distribution shift. It characterizes the optimal surrogate ${\boldsymbol{\beta}}^{s*}$ and reveals an eigen-spectrum–driven amplification-to-shrinkage transition, clarifying when discarding weak features improves downstream risk. The paper further links weak-to-strong generalization to a mask-based surrogate selection and proves an asymptotic scaling law that the surrogate can improve risk without changing the fundamental data-efficiency scaling, a finding corroborated by numerical experiments and CIFAR-10-style tests. It then develops a two-stage ERM framework, deriving non-asymptotic risk expressions for the two-stage model and showing that while the surrogate can yield strict improvements, it does not alter the scaling law relative to the standard target model. Overall, the results illuminate when weak supervision helps in high-dimensional settings and provide precise prescriptions for surrogate design and feature selection.

Abstract

A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

TL;DR

This work develops a sharp, non-asymptotic analysis of knowledge distillation in high-dimensional ridgeless regression, addressing both model shift and distribution shift. It characterizes the optimal surrogate and reveals an eigen-spectrum–driven amplification-to-shrinkage transition, clarifying when discarding weak features improves downstream risk. The paper further links weak-to-strong generalization to a mask-based surrogate selection and proves an asymptotic scaling law that the surrogate can improve risk without changing the fundamental data-efficiency scaling, a finding corroborated by numerical experiments and CIFAR-10-style tests. It then develops a two-stage ERM framework, deriving non-asymptotic risk expressions for the two-stage model and showing that while the surrogate can yield strict improvements, it does not alter the scaling law relative to the standard target model. Overall, the results illuminate when weak supervision helps in high-dimensional settings and provide precise prescriptions for surrogate design and feature selection.

Abstract

A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

Paper Structure

This paper contains 14 sections, 34 theorems, 169 equations, 5 figures, 1 table.

Key Result

Theorem 1

Suppose that, for some constant $M_t>1$, we have $1/M_t \leq \kappa_t, {\sigma}_{t}^2 \leq M_t$ and $\left\|{\boldsymbol{\Sigma}_{t}}\right\|_{\text{op}}, \left\|{\boldsymbol{\Sigma}_{t}^{-1}}\right\|_{\text{op}} \leq M_t$. Recall from setup risk that $\mathcal{R}({\boldsymbol{\beta}}^{s2t})$ repr

Figures (5)

  • Figure 1: Structure and performance of optimal surrogate models. (a): We compare the weights of the optimal surrogate model (green) with the ground-truth (blue). This reveals a transition from amplification to shrinkage as we move from principal to tail eigenvalues. The yellow curve displays the optimal 0-1 masking of the ground-truth where we either keep or discard a feature. (b): Associated test risks as a function of sample size. The theoretical bounds (full lines) match the experiments (markers). Setting: The feature size is $p=500$; the sample size is $n=200$ in (a) and variable in (b); the feature covariance follows the power-law structure $\lambda_i = i^{-2}$, $\lambda_i \beta_i^2 = i^{-1.5}$; $\zeta_i$ is the covariance statistics (see Corollary \ref{['corol optimal surrogate']}) governing the optimal surrogate's structure.
  • Figure 2: (a): On the CIFAR-10 dataset, we fine-tune a ResNet50 model using the ground-truth labels (target) and the predictions of three weak convolutional models (surrogate) with different capacities: big (b), medium (m), and small (s). We observe that surrogate-to-target models consistently outperform surrogate models' accuracies, even though they are trained on the surrogate models' predictions. (b): We compare the experimental two-stage risk with our estimated theoretical risk. In the experimental setup, $p = 100$, and we vary $n = m$ from 1 to 100. Both feature covariances follow the power-law structure $\lambda_i = i^{-\alpha}$ for $\alpha = 0.5, 1, 1.5$ and $2$; the ground truth parameter ${\boldsymbol{\beta}_\star}$ is specified as $\beta_i = 1$.
  • Figure 3: Comparison of the empirical and theoretical number of features satisfying the feature selection condition in the optimal mask $\mathcal{M}^{*}$ ($\zeta_i^2 < 1 - \Omega$). The theoretical value is calculated as $n \dfrac{\alpha \sin{(\pi / \alpha)}}{\pi (\sqrt{\alpha} - 1)^{1 / \alpha}}$, ignoring the $O(1)$ in Proposition \ref{['closed-form pruning condition']}. Setting: The feature size is $p=500$, and the feature covariance follows the power-law structure $\lambda_i = i^{-\alpha}$ for $\alpha = 1.5, 3.0,$ and $4.5$.
  • Figure 4: Scaling law behavior of the test risks of optimal surrogate models. (a): Associated test risks as a function of sample size in log-log scale. Setting: The feature size is $p=500$; the sample size $n$ changes from $50$ to $450$ with increments of $50$; the feature covariance follows the power-law structure $\lambda_i = i^{-2}$, $\lambda_i \beta_i^2 = i^{-1.5}$(b): Associated test risks as a function of sample size in log-log scale when $p \gg n$. Setting: Same as in (a) except that $p=5000$.
  • Figure 5: We compare the experimental two-stage risk with our estimated theoretical risk in log-log scale to demonstrate the scaling law. Setting: The feature size is $p=1000$; the sample sizes $m=n$ change from $50$ to $900$ with increments of $50$; both feature covariances follow the power-law structure $\lambda_i = i^{-\alpha}$ for $\alpha = 0.5, 1, 1.5$ and $2$; the ground truth parameter ${\boldsymbol{\beta}_\star}$ is specified as $\beta_i = 1$.

Theorems & Definitions (68)

  • Definition 1
  • Theorem 1
  • Proposition 1
  • Corollary 1
  • Proposition 2
  • Proposition 3
  • Definition 2: Omniscient test risk estimate
  • Proposition 4: Asymptotic analysis of $\taus$ and $\Omega$
  • Proposition 5
  • Proposition 6: Scaling law
  • ...and 58 more