Distillation Scaling Laws
Dan Busbridge, Amitis Shidani, Floris Weers, Jason Ramapuram, Etai Littwin, Russ Webb
TL;DR
This work introduces a distillation scaling law that predicts a student model’s cross-entropy from a given compute budget and the resources allocated to teacher and student. It reveals a capacity-gap phenomenon where a stronger teacher can hinder a weaker student, and provides a broken-power-law functional form validated by large-scale transformer experiments (143M–12.6B parameters) on the C4 English dataset. The authors offer compute-optimal distillation recipes across scenarios (existing vs. to-be-trained teachers) and show distillation can outperform supervised learning under modest compute budgets or when a teacher is available, but supervised training dominates at larger data budgets or when teacher training costs are included. This framework guides practical decisions for building smaller, efficient language models with lower inference costs and reduced lifecycle compute.
Abstract
We propose a distillation scaling law that estimates distilled model performance based on a compute budget and its allocation between the student and teacher. Our findings mitigate the risks associated with large-scale distillation by enabling compute-optimal allocation for both the teacher and student to maximize student performance. We provide compute-optimal distillation recipes for two key scenarios: when a teacher already exists, and when a teacher needs training. In settings involving many students or an existing teacher, distillation outperforms supervised learning up to a compute level that scales predictably with student size. Conversely, if only one student is to be distilled and a teacher also requires training, supervised learning is generally preferable. Additionally, our large-scale study of distillation increases our understanding of the process and helps inform experimental design.
