Table of Contents
Fetching ...

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Chen Liang, Haoming Jiang, Zheng Li, Xianfeng Tang, Bin Yin, Tuo Zhao

TL;DR

The paper tackles task-agnostic distillation for large Transformer models, where a substantial prediction gap between a powerful teacher and a compact student hampers knowledge transfer. It introduces HomoDistil, which initializes the student from the teacher and alternates iterative neuron pruning with distillation losses to keep the discrepancy small throughout training. Empirical results on GLUE and SQuAD demonstrate that HomoDistil yields state-of-the-art or competitive performance across several parameter budgets, with notable gains for very small models. The work also analyzes the importance of maintaining low prediction discrepancy, the benefits of iterative pruning, and the impact of different pruning metrics, highlighting practical benefits for edge-friendly, pre-trained language models.

Abstract

Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that HomoDistil achieves significant improvements on existing baselines.

HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

TL;DR

The paper tackles task-agnostic distillation for large Transformer models, where a substantial prediction gap between a powerful teacher and a compact student hampers knowledge transfer. It introduces HomoDistil, which initializes the student from the teacher and alternates iterative neuron pruning with distillation losses to keep the discrepancy small throughout training. Empirical results on GLUE and SQuAD demonstrate that HomoDistil yields state-of-the-art or competitive performance across several parameter budgets, with notable gains for very small models. The work also analyzes the importance of maintaining low prediction discrepancy, the benefits of iterative pruning, and the impact of different pruning metrics, highlighting practical benefits for edge-friendly, pre-trained language models.

Abstract

Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that HomoDistil achieves significant improvements on existing baselines.
Paper Structure (26 sections, 14 equations, 3 figures, 16 tables, 1 algorithm)

This paper contains 26 sections, 14 equations, 3 figures, 16 tables, 1 algorithm.

Figures (3)

  • Figure 1: Left: In HomoDistil, the student is initialized from the teacher and is iteratively pruned through the distillation process. The widths of rectangles represent the widths of layers. The depth of color represents the sufficiency of training. Right: An illustrative comparison of the student's optimization trajectory in HomoDistil and standard distillation. We define the region where the prediction discrepancy is sufficiently small such that the distillation is effective as the Effective Distillation Region. In HomoDistil, as the student is initialized with the teacher and is able to maintain this small discrepancy, the trajectory consistently lies in the region. In standard distillation, as the student is initialized with a much smaller capacity than the teacher's, the distillation is ineffective at the early stage of training.
  • Figure 2: The prediction discrepancy during the distillation of HomoBERT models under different schedules of sparsity.
  • Figure 3: The accuracy of fine-tuning HomoBERT-small distilled with different schedules of sparsity on the development set of GLUE benchmark.