High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

M. Emrullah Ildiz; Halil Alperen Gozeten; Ege Onur Taga; Marco Mondelli; Samet Oymak

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

M. Emrullah Ildiz, Halil Alperen Gozeten, Ege Onur Taga, Marco Mondelli, Samet Oymak

TL;DR

This work develops a sharp, non-asymptotic analysis of knowledge distillation in high-dimensional ridgeless regression, addressing both model shift and distribution shift. It characterizes the optimal surrogate ${\boldsymbol{\beta}}^{s*}$ and reveals an eigen-spectrum–driven amplification-to-shrinkage transition, clarifying when discarding weak features improves downstream risk. The paper further links weak-to-strong generalization to a mask-based surrogate selection and proves an asymptotic scaling law that the surrogate can improve risk without changing the fundamental data-efficiency scaling, a finding corroborated by numerical experiments and CIFAR-10-style tests. It then develops a two-stage ERM framework, deriving non-asymptotic risk expressions for the two-stage model and showing that while the surrogate can yield strict improvements, it does not alter the scaling law relative to the standard target model. Overall, the results illuminate when weak supervision helps in high-dimensional settings and provide precise prescriptions for surrogate design and feature selection.

Abstract

A growing number of machine learning scenarios rely on knowledge distillation where one uses the output of a surrogate model as labels to supervise the training of a target model. In this work, we provide a sharp characterization of this process for ridgeless, high-dimensional regression, under two settings: (i) model shift, where the surrogate model is arbitrary, and (ii) distribution shift, where the surrogate model is the solution of empirical risk minimization with out-of-distribution data. In both cases, we characterize the precise risk of the target model through non-asymptotic bounds in terms of sample size and data distribution under mild conditions. As a consequence, we identify the form of the optimal surrogate model, which reveals the benefits and limitations of discarding weak features in a data-dependent fashion. In the context of weak-to-strong (W2S) generalization, this has the interpretation that (i) W2S training, with the surrogate as the weak model, can provably outperform training with strong labels under the same data budget, but (ii) it is unable to improve the data scaling law. We validate our results on numerical experiments both on ridgeless regression and on neural network architectures.

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

TL;DR

Abstract

High-dimensional Analysis of Knowledge Distillation: Weak-to-Strong Generalization and Scaling Laws

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (68)