Table of Contents
Fetching ...

MiniPLM: Knowledge Distillation for Pre-Training Language Models

Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, Minlie Huang

TL;DR

MiniPLM tackles the inefficiency of pre-training knowledge distillation by introducing Difference Sampling, an offline data refinement method that leverages the discrepancy between a large teacher LM and a small reference LM to produce a harder, more diverse pre-training corpus. By decoupling reward computation from the student and performing offline teacher inference, MiniPLM enables KD across model families without extra training-time costs. Empirical results show consistent downstream gains, improved language modeling, and reduced pre-training compute, with benefits extending to data-limited settings and cross-family distillation. The approach enhances data utilization and provides a practical, scalable pathway for building high-performing small LMs under fixed compute budgets.

Abstract

Knowledge distillation (KD) is widely used to train small, high-performing student language models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training faces efficiency, flexibility, and effectiveness issues. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. In this work, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher LM's knowledge. For efficiency, MiniPLM performs offline teacher inference, allowing KD for multiple student LMs without adding training costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the training data difficulty and diversity, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to larger training scales, evidenced by the scaling curve extrapolation. Further analysis reveals that MiniPLM supports KD across model families and enhances the pre-training data utilization. Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.

MiniPLM: Knowledge Distillation for Pre-Training Language Models

TL;DR

MiniPLM tackles the inefficiency of pre-training knowledge distillation by introducing Difference Sampling, an offline data refinement method that leverages the discrepancy between a large teacher LM and a small reference LM to produce a harder, more diverse pre-training corpus. By decoupling reward computation from the student and performing offline teacher inference, MiniPLM enables KD across model families without extra training-time costs. Empirical results show consistent downstream gains, improved language modeling, and reduced pre-training compute, with benefits extending to data-limited settings and cross-family distillation. The approach enhances data utilization and provides a practical, scalable pathway for building high-performing small LMs under fixed compute budgets.

Abstract

Knowledge distillation (KD) is widely used to train small, high-performing student language models (LMs) using large teacher LMs. While effective in fine-tuning, KD during pre-training faces efficiency, flexibility, and effectiveness issues. Existing methods either incur high computational costs due to online teacher inference, require tokenization matching between teacher and student LMs, or risk losing the difficulty and diversity of the teacher-generated training data. In this work, we propose MiniPLM, a KD framework for pre-training LMs by refining the training data distribution with the teacher LM's knowledge. For efficiency, MiniPLM performs offline teacher inference, allowing KD for multiple student LMs without adding training costs. For flexibility, MiniPLM operates solely on the training corpus, enabling KD across model families. For effectiveness, MiniPLM leverages the differences between large and small LMs to enhance the training data difficulty and diversity, helping student LMs acquire versatile and sophisticated knowledge. Extensive experiments demonstrate that MiniPLM boosts the student LMs' performance on 9 common downstream tasks, improves language modeling capabilities, and reduces pre-training computation. The benefit of MiniPLM extends to larger training scales, evidenced by the scaling curve extrapolation. Further analysis reveals that MiniPLM supports KD across model families and enhances the pre-training data utilization. Our code, data, and models can be found at https://github.com/thu-coai/MiniPLM.

Paper Structure

This paper contains 56 sections, 1 theorem, 17 equations, 11 figures, 13 tables.

Key Result

Proposition 2.1

Let $S$ be the sample space of two distributions $p_1$ and $p_2$, ${\mathbf{X}}_1, {\mathbf{X}}_2, \cdots, {\mathbf{X}}_N \sim p_1$ be $N$ i.i.d random variables, and ${\mathbf{Y}}_1, {\mathbf{Y}}_2, \cdots, {\mathbf{Y}}_{M}\sim p_2$ be $M$ i.i.d random variables. Let $r(\cdot): S \mapsto \mathbb{R}

Figures (11)

  • Figure 1: Computation (a) and model size (b) scaling curves of student LMs pre-trained from scratch with Vanilla KD and $\textsc{MiniPLM}$. The teacher LM has 1.8B parameters. "1.8B$\rightarrow$500M" means we use a 500M student LM. Training-time computation is kept constant for LMs of the same size in model scaling. The y-axis represents the LMs' zero-shot performance on 9 downstream NLP tasks.
  • Figure 2: Results of applying KD methods in fine-tuning to pre-train a 200M student LM, using a 1.8B teacher LM. See Section \ref{['sec:exp_setup']} for method and evaluation details. When the training FLOPs are controlled, all KD methods perform similar or worse than Pre-Train w/o KD.
  • Figure 3: $\textsc{MiniPLM}$. (a): Training framework. $\textsc{MiniPLM}$ distills the knowledge of the teacher LM into the student LM by adjusting the pre-training corpus of the student LM ($q_{\bm{\theta}}$) through offlineDifference Sampling, based on the output probability discrepancy between the teacher LM ($p$) and a small reference LM ($p_{\text{ref}}$). (b): Illustration of the effect of Difference Sampling, which down-samples common easy instances, up-samples hard valuable instances, and removes noisy harmful instances.
  • Figure 4: Language modeling loss on the DCLM dclm subset. We distill the knowledge of the 1.8B Qwen model qwen into student LMs from the Qwen family with 200M, 500M, and 1.2B parameters. We control the total training-time FLOPs of different methods to be the same.
  • Figure 5: Results of KD across model families. We use the teacher and reference LM from the Qwen family to distill the Llama3.1 and Mamba models. The average zero-shot accuracies on the downstream tasks and the losses on the DCLM corpus are reported. Note that Vanilla KD and MiniLLM cannot be applied when the teacher and student LMs use different tokenizations.
  • ...and 6 more figures

Theorems & Definitions (1)

  • Proposition 2.1