Table of Contents
Fetching ...

Pre-trained Summarization Distillation

Sam Shleifer, Alexander M. Rush

TL;DR

This work systematically compares three distillation paradigms for pre-trained Seq2Seq summarization models—Shrink and Fine-Tune (SFT), Pseudo-labeling (PL), and Direct Knowledge Distillation (KD)—using BART and Pegasus as teachers on CNN/Daily Mail and XSUM. It finds that SFT delivers the best efficiency and performance on CNN, whereas KD and PL offer stronger gains on the more abstractive XSUM, with some configurations approaching teacher performance under substantial speedups. The study also analyzes initialization strategies, the impact of pseudo-label quality, and the effects of KD loss components, concluding that SFT should be tried first, followed by PL in many cases. Practical guidance and public code are provided to help practitioners deploy compact, fast summarization models without substantial performance penalties.

Abstract

Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch Code and checkpoints of different sizes are available through Hugging Face transformers here http://tiny.cc/4iy0tz.

Pre-trained Summarization Distillation

TL;DR

This work systematically compares three distillation paradigms for pre-trained Seq2Seq summarization models—Shrink and Fine-Tune (SFT), Pseudo-labeling (PL), and Direct Knowledge Distillation (KD)—using BART and Pegasus as teachers on CNN/Daily Mail and XSUM. It finds that SFT delivers the best efficiency and performance on CNN, whereas KD and PL offer stronger gains on the more abstractive XSUM, with some configurations approaching teacher performance under substantial speedups. The study also analyzes initialization strategies, the impact of pseudo-label quality, and the effects of KD loss components, concluding that SFT should be tried first, followed by PL in many cases. Practical guidance and public code are provided to help practitioners deploy compact, fast summarization models without substantial performance penalties.

Abstract

Recent state-of-the-art approaches to summarization utilize large pre-trained Transformer models. Distilling these models to smaller student models has become critically important for practical use; however there are many different distillation methods proposed by the NLP literature. Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation. Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model. A third, simpler approach is to 'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning. We compare these three approaches for distillation of Pegasus and BART, the current and former state of the art, pre-trained summarization models, and find that SFT outperforms knowledge distillation and pseudo-labeling on the CNN/DailyMail dataset, but under-performs pseudo-labeling on the more abstractive XSUM dataset. PyTorch Code and checkpoints of different sizes are available through Hugging Face transformers here http://tiny.cc/4iy0tz.

Paper Structure

This paper contains 21 sections, 5 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: The best distilled checkpoint from Pegasus (P) and Bart (B) for XSUM and CNN at different sizes. In three out of four settings we are able to distill a student model to the same Rouge-2 score as the teacher with at least a 90% speedup.
  • Figure 2: Training curves for different initialization strategies. Each line represents one fine-tuning run for a BART student on XSUM using a different initialization strategy. Initialization strategies are described in Table \ref{['init_results']}.