Table of Contents
Fetching ...

MixKD: Towards Efficient Distillation of Large-scale Language Models

Kevin J Liang, Weituo Hao, Dinghan Shen, Yufan Zhou, Weizhu Chen, Changyou Chen, Lawrence Carin

TL;DR

MixKD is proposed, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability, and it is proved, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error.

Abstract

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

MixKD: Towards Efficient Distillation of Large-scale Language Models

TL;DR

MixKD is proposed, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability, and it is proved, from a theoretical perspective, that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error.

Abstract

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

Paper Structure

This paper contains 19 sections, 5 theorems, 47 equations, 3 figures, 5 tables.

Key Result

Theorem 1

Assume the loss function $l(\cdot, \cdot)$ is upper bounded by $M>0$. Under Case 1, there exists a constant $c>0$ such that if then where $\epsilon^*$ and $\epsilon_p$ denote the minimal generalization gaps one can achieve with or without augmented data, with at least $1-\delta$ probability. If further assuming a better empirical risk with data augmentation (which is usually the case in practice

Figures (3)

  • Figure 1: Results of limited data case, where both the teacher and student models are learned with only 10% (left) or 1% of the training data (right).
  • Figure 2: Latent space of randomly sampled training data and their mixup neighbours encoded by student model (a) learned by standard fine-tuning (b) learned with MixKD.
  • Figure 3: Hyperparameter sensitivity analysis regarding the MixKD approach, with different choices of $\alpha_{\text{TMKD}}, \alpha_{\text{SM}}$ and the ratio of mixup samples (w.r.t. the original training data).

Theorems & Definitions (6)

  • Theorem 1
  • Theorem 2
  • Theorem 3
  • Remark 4
  • Theorem 5
  • Theorem 6