Towards a theory of model distillation
Enric Boix-Adsera
TL;DR
This work formalizes model distillation through PAC-distillation, establishing a framework to analyze when a simpler model can approximate a trained, larger model under distributional data. It shows distillation can be cheaper than learning from scratch and develops a general theory of computational reductions and statistical bounds, including a novel Linear Representation Hypothesis (LRH) that enables distillation of networks into explicit decision trees in polynomial time. The paper presents two case studies—distilling networks into juntas and into decision trees—demonstrating both algorithmic feasibility and practical benefits, and provides a web of reductions to relate distillation across model classes. It also discusses robust statistical results, highlighting perfect vs. agnostic distillation, Pareto-frontier bounds, and limitations such as the non-characterization of agnostic-distillation sample complexity, while outlining extensions toward broader model classes and foundation models. Overall, the work lays foundational theory and practical algorithms for distillation with potential impact on interpretability, efficiency, and resource-sharing in machine learning systems.
Abstract
Distillation is the task of replacing a complicated machine learning model with a simpler model that approximates the original [BCNM06,HVD15]. Despite many practical applications, basic questions about the extent to which models can be distilled, and the runtime and amount of data needed to distill, remain largely open. To study these questions, we initiate a general theory of distillation, defining PAC-distillation in an analogous way to PAC-learning [Val84]. As applications of this theory: (1) we propose new algorithms to extract the knowledge stored in the trained weights of neural networks -- we show how to efficiently distill neural networks into succinct, explicit decision tree representations when possible by using the ``linear representation hypothesis''; and (2) we prove that distillation can be much cheaper than learning from scratch, and make progress on characterizing its complexity.
