Table of Contents
Fetching ...

Weight-Inherited Distillation for Task-Agnostic BERT Compression

Taiqiang Wu, Cheng Hou, Shanshan Lao, Jiayi Li, Ngai Wong, Zhe Zhao, Yujiu Yang

TL;DR

A novel Weight-Inherited Distillation (WID) is proposed, which directly transfers knowledge from the teacher and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation.

Abstract

Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. The code is available at https://github.com/wutaiqiang/WID-NAACL2024.

Weight-Inherited Distillation for Task-Agnostic BERT Compression

TL;DR

A novel Weight-Inherited Distillation (WID) is proposed, which directly transfers knowledge from the teacher and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation.

Abstract

Knowledge Distillation (KD) is a predominant approach for BERT compression. Previous KD-based methods focus on designing extra alignment losses for the student model to mimic the behavior of the teacher model. These methods transfer the knowledge in an indirect way. In this paper, we propose a novel Weight-Inherited Distillation (WID), which directly transfers knowledge from the teacher. WID does not require any additional alignment loss and trains a compact student by inheriting the weights, showing a new perspective of knowledge distillation. Specifically, we design the row compactors and column compactors as mappings and then compress the weights via structural re-parameterization. Experimental results on the GLUE and SQuAD benchmarks show that WID outperforms previous state-of-the-art KD-based baselines. Further analysis indicates that WID can also learn the attention patterns from the teacher model without any alignment loss on attention distributions. The code is available at https://github.com/wutaiqiang/WID-NAACL2024.
Paper Structure (40 sections, 15 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 40 sections, 15 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of compressing linear layer $L_T$ with weight $\mathbf{W}^{L_T} \in \mathbb{R}^{B \times C}$ to compact linear layer $L_S$ with weight $\mathbf{W}^{L_S} \in \mathbb{R}^{D \times E}$ via WID. Both row compactor and column compactor are initialized as identity matrices. After training, we compress the compactors and merge them with the original layer. All the linear layers in the teacher model are compressed simultaneously.
  • Figure 2: Training and compression for column compactor. During the training process, we add weight penalty gradients by columns and progressively select the mask to fuse the penalty gradients and original loss gradients. After training, we compress the column compactor following the column mask.
  • Figure 3: Compactor merging process for a Transformer block. For the bias terms, we merge them with corresponding column compactors. For beta and gamma in Layer Norm (LN), we adopt the previous column compactors to update them. During training, the compactors in the same color are aligned. For each group of the aligned compactors, we learn one of them and duplicate (or, flip) it for the rest compactors.
  • Figure 4: Attention distributions under same input tokens for $\text{BERT}_{\text{base}}$ (upper), $\text{WID}_{\text{11}}^{dim}$ (middle), and $\text{BERT}_{\text{11}}$ (bottom). Our WID can learn the knowledge about attention distributions from the teacher without any alignment loss.
  • Figure 5: The training process for teacher GPT, vanilla student GPT, and students via KD and WID.
  • ...and 3 more figures