Table of Contents
Fetching ...

Towards the Law of Capacity Gap in Distilling Language Models

Chen Zhang, Qiuchi Li, Dawei Song, Zheyu Ye, Yan Gao, Yan Hu

TL;DR

The paper tackles the problem that enlarging the teacher in LM distillation does not always improve the student, proposing the law of capacity gap, a linear relation between the target student scale and its optimal teacher scale. It demonstrates this law through small-scale pilot studies using pruning and distillation on GPT2 and Pythia, then extrapolates to larger LLMs by distilling 7B/8B teachers to a 3B student (MiniMA) and finetuning into MiniChat. The key finding is that the optimal teacher scale scales linearly with the student, enabling compute-efficient distillation and a superior compute-performance frontier across benchmarks, including instruction-following tasks. This work reduces the need for exhaustive teacher-search, offers a practical path to high-performing compact LLMs, and introduces MiniMA and MiniChat as strong, efficient milestones for scalable distillation. Future directions include extending the law to more architectures, data regimes, and safety-aware fine-tuning.

Abstract

Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the \textit{curse of capacity gap}, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the \textit{law of capacity gap} inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.

Towards the Law of Capacity Gap in Distilling Language Models

TL;DR

The paper tackles the problem that enlarging the teacher in LM distillation does not always improve the student, proposing the law of capacity gap, a linear relation between the target student scale and its optimal teacher scale. It demonstrates this law through small-scale pilot studies using pruning and distillation on GPT2 and Pythia, then extrapolates to larger LLMs by distilling 7B/8B teachers to a 3B student (MiniMA) and finetuning into MiniChat. The key finding is that the optimal teacher scale scales linearly with the student, enabling compute-efficient distillation and a superior compute-performance frontier across benchmarks, including instruction-following tasks. This work reduces the need for exhaustive teacher-search, offers a practical path to high-performing compact LLMs, and introduces MiniMA and MiniChat as strong, efficient milestones for scalable distillation. Future directions include extending the law to more architectures, data regimes, and safety-aware fine-tuning.

Abstract

Language model (LM) distillation aims at distilling the knowledge in a large teacher LM to a small student one. As a critical issue facing LM distillation, a superior student often arises from a teacher of a relatively small scale instead of a larger one, especially in the presence of substantial capacity gap between the teacher and student. This issue, often referred to as the \textit{curse of capacity gap}, suggests that there is likely an optimal teacher yielding the best-performing student along the scaling course of the teacher. Consequently, distillation trials on teachers of a wide range of scales are called for to determine the optimal teacher, which becomes computationally intensive in the context of large LMs (LLMs). This paper addresses this critical bottleneck by providing the \textit{law of capacity gap} inducted from a preliminary study on distilling a broad range of small-scale (<3B) LMs, where the optimal teacher consistently scales linearly with the student scale across different model and data scales. By extending the law to LLM distillation on a larger scale (7B), we succeed in obtaining versatile LLMs that outperform a wide array of competitors.
Paper Structure (35 sections, 2 theorems, 8 equations, 10 figures, 13 tables)

This paper contains 35 sections, 2 theorems, 8 equations, 10 figures, 13 tables.

Key Result

Proposition 1

Provided a to-be-distilled student of an expected scale, the teacher of an optimal scale can be uniquely determined through a scaling relation.

Figures (10)

  • Figure 1: The curse of capacity gap. GPT2 DBLP:journals/corr/RadfordWC19 and Pythia DBLP:conf/icml/BidermanSABOHKP23 distilled with OpenWebText GokaslanC19, and evaluated on WikiText2 DBLP:conf/iclr/MerityX0S17 in perplexity. curse, the performance of a fixed student scale does not improve along the increased teacher scale.
  • Figure 2: The curse of capacity gap can result in an impossible triangle in the era of LLMs. Optimal teacher scale according to expected student scale can not be yielded via small compute overhead, besides the one required by the oracle distillation.
  • Figure 3: The observations from the distillation of GPT2 and Pythia series. Students are evaluated on the test set of WikiText2 DBLP:conf/iclr/MerityX0S17 in perplexity and the test set of LAMBADA DBLP:conf/acl/PapernoKLPBPBBF16 in last word prediction accuracy. A line of a color represents a sparsity, where each point in it represents a student pruned and distilled from a teacher at such sparsity.
  • Figure 4: The curse of capacity gap can be leaned towards the law of capacity gap. law, the optimal teacher scale exists and remains linear to the student scale. Each point stands for the best teacher scale for a fixed student scale.
  • Figure 5: The new compute-performance pareto frontier is yielded by MiniMA, namely MiniMA is more compute-efficient given any compute budget than existing LMs. The radius of each circle stands for the model scale. performance: average task measure as each detailed in Appendix \ref{['app:bench_lm']}. compute: estimated training compute in $\times$109 TFLOPs as detailed in Appendix \ref{['app:est_compute']}.
  • ...and 5 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Corollary 1
  • Remark 1