Table of Contents
Fetching ...

Distillation Scaling Laws

Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb

TL;DR

This work introduces a distillation scaling law that predicts a student model’s cross-entropy from a given compute budget and the resources allocated to teacher and student. It reveals a capacity-gap phenomenon where a stronger teacher can hinder a weaker student, and provides a broken-power-law functional form validated by large-scale transformer experiments (143M–12.6B parameters) on the C4 English dataset. The authors offer compute-optimal distillation recipes across scenarios (existing vs. to-be-trained teachers) and show distillation can outperform supervised learning under modest compute budgets or when a teacher is available, but supervised training dominates at larger data budgets or when teacher training costs are included. This framework guides practical decisions for building smaller, efficient language models with lower inference costs and reduced lifecycle compute.

Abstract

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Distillation Scaling Laws

TL;DR

This work introduces a distillation scaling law that predicts a student model’s cross-entropy from a given compute budget and the resources allocated to teacher and student. It reveals a capacity-gap phenomenon where a stronger teacher can hinder a weaker student, and provides a broken-power-law functional form validated by large-scale transformer experiments (143M–12.6B parameters) on the C4 English dataset. The authors offer compute-optimal distillation recipes across scenarios (existing vs. to-be-trained teachers) and show distillation can outperform supervised learning under modest compute budgets or when a teacher is available, but supervised training dominates at larger data budgets or when teacher training costs are included. This framework guides practical decisions for building smaller, efficient language models with lower inference costs and reduced lifecycle compute.

Abstract

We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.

Paper Structure

This paper contains 115 sections, 2 theorems, 62 equations, 54 figures, 13 tables.

Key Result

Lemma 3.1

The optimal teacher $g^\star$ is given by: The teacher error $e^\star_\text{teacher}(m,T)$ is given by:

Figures (54)

  • Figure 1: Extrapolations of the Distillation Scaling Law. The distillation scaling law (\ref{['eq:distillation-scaling-law']}) is fitted to students with high cross-entropy ($L_S > 2.3$) for a range of teachers with cross-entropies $L_T$. Solid lines represent predicted model behavior for unseen teachers for a given student configuration (interpolation), and dashed lines represent predicted model behavior beyond seen teachers and for low cross-entropy students ($L_S \leq 2.3$). The diagonal block dashed line indicates where student and teacher cross-entropies are equal. Teachers with lower cross-entropy generally produce students with lower cross-entropy, until the capacity gap (see \ref{['fig:fixedm-teacher-fixedm-students']} and \ref{['ssec:the-capacity-gap']}). As shown, a student can also outperform its teacher (see \ref{['fig:fixedm-teacher-isoflop-students', 'fig:isoflop-teacher-fixedm-students', 'fig:distillation-fixedm-teacher-varydata-student']}).
  • Figure 2: Fixed $\bm M$ Teacher/Student IsoFLOP profiles. Two of six teachers with a token-to-parameter ratio $M_T=D_T/N_T\approx 20$ are distilled into students across four IsoFLOP profiles defined by compute budgets $C_S\in\{3\times 10^{19},10^{20},3\times 10^{20},10^{21}\}$ FLOPs. A small number of additional distillations were also performed using $C_S=3\times 10^{21}$ FLOPs. Here, $C_S$only includes the standard training cost of a model of size $N_S$ trained on $D_S$ tokens, i.e. the cost of teacher training and teacher inference is not included. Horizontal and vertical dashed lines indicate teacher cross entropy $L_T$ and size $N_T$ respectively. See \ref{['ssec:distillation-isoflop-profiles']}, \ref{['fig:fixedm-teacher-isoflop-students-app']} for all six teacher profiles corresponding to $N_T\in\{546M,975M,1.82B,2.72B,4.82B,7.75B\}$.
  • Figure 3: IsoFLOP Teacher/Fixed $\bm M$ Students.(a) One of four students with a token-to-parameter ratio $M_S=D_S/N_S\approx20$ is distilled from teachers with four IsoFLOP profiles defined by compute budgets $C_T\in\{3\times10^{19},10^{20},3\times10^{20},10^{21}\}$ FLOPs. For all four student sizes $N_S\in\{546M, 975M,1.82B,7.75B\}$, see \ref{['ssec:distillation-isoflop-profiles']}, \ref{['fig:isoflop-teacher-fixedm-students-app']}. (b) All profiles are plotted against teacher cross-entropy $L_T$. Horizontal (vertical) dashed lines show student supervised cross-entropy $\widetilde{L}_S$ (student size $N_S$).
  • Figure 4: Fixed $\bm M$ Teacher/Fixed $\bm M$ Student. Students of two sizes trained with different token-to-parameter ratios $M_S=D_S/N_S\in\{20,40,80,160,320\}$ are distilled from teachers of various sizes with a token-to-parameter ratio $M_T=D_T/N_T\approx 20$. The capacity gap is visible: student cross-entropy decreases to an optimum and then increases with increasing teacher size $N_T$.
  • Figure 5: Scaling law fits.(a) The supervised scaling law (\ref{['eq:supervised-scaling-law']}) applied to the data in \ref{['fig:supervised-fixed-long']}. (b) Our distillation scaling law (\ref{['eq:distillation-scaling-law']}) applied to the data in \ref{['fig:fixedm-teacher-isoflop-students', 'fig:isoflop-teacher-fixedm-students', 'fig:fixedm-teacher-fixedm-students']}. Orange points show predictions from a scaling law fitted on high cross-entropy models, for which the grey region is extrapolation. Blue points show predictions from a scaling law fitted on all data.
  • ...and 49 more figures

Theorems & Definitions (4)

  • Lemma 3.1
  • proof
  • Lemma 3.2
  • proof