Table of Contents
Fetching ...

LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning

Kaitao Song, Hao Sun, Xu Tan, Tao Qin, Jianfeng Lu, Hongzhi Liu, Tie-Yan Liu

TL;DR

<3-5 sentence high-level summary>LightPAFF presents a two-stage knowledge distillation framework that compresses large pre-trained models by transferring knowledge from a big teacher to a lightweight student in both pre-training and fine-tuning. The method unifies distillation losses across MLM/CLM/MSSM for pre-training and LU/LM/SS for fine-tuning, enabling substantial model size reductions (about 5x) with only modest accuracy loss and significant online-inference speedups (5x–7x). Evaluations across BERT, GPT-2, and MASS demonstrate that LightPAFF maintains near-teacher performance on diverse tasks—from language understanding to machine translation—while delivering large gains in efficiency. The work also analyzes unlabeled data usage, ablations, and task-difficulty factors to illuminate when and why dual-stage distillation is beneficial for pre-training-then-fine-tuning pipelines.

Abstract

While pre-training and fine-tuning, e.g., BERT~\citep{devlin2018bert}, GPT-2~\citep{radford2019language}, have achieved great success in language understanding and generation tasks, the pre-trained models are usually too big for online deployment in terms of both memory cost and inference speed, which hinders them from practical online usage. In this paper, we propose LightPAFF, a Lightweight Pre-training And Fine-tuning Framework that leverages two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model in both pre-training and fine-tuning stages. In this way the lightweight model can achieve similar accuracy as the big teacher model, but with much fewer parameters and thus faster online inference speed. LightPAFF can support different pre-training methods (such as BERT, GPT-2 and MASS~\citep{song2019mass}) and be applied to many downstream tasks. Experiments on three language understanding tasks, three language modeling tasks and three sequence to sequence generation tasks demonstrate that while achieving similar accuracy with the big BERT, GPT-2 and MASS models, LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.

LightPAFF: A Two-Stage Distillation Framework for Pre-training and Fine-tuning

TL;DR

<3-5 sentence high-level summary>LightPAFF presents a two-stage knowledge distillation framework that compresses large pre-trained models by transferring knowledge from a big teacher to a lightweight student in both pre-training and fine-tuning. The method unifies distillation losses across MLM/CLM/MSSM for pre-training and LU/LM/SS for fine-tuning, enabling substantial model size reductions (about 5x) with only modest accuracy loss and significant online-inference speedups (5x–7x). Evaluations across BERT, GPT-2, and MASS demonstrate that LightPAFF maintains near-teacher performance on diverse tasks—from language understanding to machine translation—while delivering large gains in efficiency. The work also analyzes unlabeled data usage, ablations, and task-difficulty factors to illuminate when and why dual-stage distillation is beneficial for pre-training-then-fine-tuning pipelines.

Abstract

While pre-training and fine-tuning, e.g., BERT~\citep{devlin2018bert}, GPT-2~\citep{radford2019language}, have achieved great success in language understanding and generation tasks, the pre-trained models are usually too big for online deployment in terms of both memory cost and inference speed, which hinders them from practical online usage. In this paper, we propose LightPAFF, a Lightweight Pre-training And Fine-tuning Framework that leverages two-stage knowledge distillation to transfer knowledge from a big teacher model to a lightweight student model in both pre-training and fine-tuning stages. In this way the lightweight model can achieve similar accuracy as the big teacher model, but with much fewer parameters and thus faster online inference speed. LightPAFF can support different pre-training methods (such as BERT, GPT-2 and MASS~\citep{song2019mass}) and be applied to many downstream tasks. Experiments on three language understanding tasks, three language modeling tasks and three sequence to sequence generation tasks demonstrate that while achieving similar accuracy with the big BERT, GPT-2 and MASS models, LightPAFF reduces the model size by nearly 5x and improves online inference speed by 5x-7x.

Paper Structure

This paper contains 34 sections, 7 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: LightPAFF pipeline.
  • Figure 2: Generalization Analysis. The result of BERT is the accuracy of SST-2 on valid set while the result of GPT-2 is the perplexity of WikiText-2 on valid set.