Table of Contents
Fetching ...

Knowledge Distillation from Internal Representations

Gustavo Aguilar, Yuan Ling, Yu Zhang, Benjamin Yao, Xing Fan, Chenlei Guo

TL;DR

Standard knowledge distillation transfers only the teacher's output probabilities, which may fail to convey internal linguistic abstractions of large transformers. The authors propose distilling internal representations by matching self-attention distributions and CLS activations, using bottom-up (PID) and stacked (SID) strategies to compress teacher knowledge into a smaller student. Across GLUE tasks, internal representation distillation consistently outperforms soft-label KD and requires roughly half the parameters to achieve near-teacher performance, demonstrating practical efficiency gains. This approach enables more faithful transfer of linguistic structure and more deployable transformer models in resource-constrained settings.

Abstract

Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

Knowledge Distillation from Internal Representations

TL;DR

Standard knowledge distillation transfers only the teacher's output probabilities, which may fail to convey internal linguistic abstractions of large transformers. The authors propose distilling internal representations by matching self-attention distributions and CLS activations, using bottom-up (PID) and stacked (SID) strategies to compress teacher knowledge into a smaller student. Across GLUE tasks, internal representation distillation consistently outperforms soft-label KD and requires roughly half the parameters to achieve near-teacher performance, demonstrating practical efficiency gains. This approach enables more faithful transfer of linguistic structure and more deployable transformer models in resource-constrained settings.

Abstract

Knowledge distillation is typically conducted by training a small model (the student) to mimic a large and cumbersome model (the teacher). The idea is to compress the knowledge from the teacher by using its output probabilities as soft-labels to optimize the student. However, when the teacher is considerably large, there is no guarantee that the internal knowledge of the teacher will be transferred into the student; even if the student closely matches the soft-labels, its internal representations may be considerably different. This internal mismatch can undermine the generalization capabilities originally intended to be transferred from the teacher to the student. In this paper, we propose to distill the internal representations of a large model such as BERT into a simplified version of it. We formulate two ways to distill such representations and various algorithms to conduct the distillation. We experiment with datasets from the GLUE benchmark and consistently show that adding knowledge distillation from internal representations is a more powerful method than only using soft-label distillation.

Paper Structure

This paper contains 18 sections, 3 equations, 5 figures, 4 tables, 1 algorithm.

Figures (5)

  • Figure 1: Knowledge distillation from internal representations. We show the internal layers that the teacher (left) distills into the student (right).
  • Figure 2: Performance vs. parameters trade-off. The points along the lines denote the number of layers used in BERT, which is reflected by the number of parameters in the x-axis.
  • Figure 3: The impact of training size for standard vs. internal KD. We experiment with sizes between 1K and +350K.
  • Figure 4: Comparing algorithm convergences across epochs. The annotations along the lines denote the layers that have been completely optimized. After the L6 point, only the classification layer is trained.
  • Figure 5: Attention comparison for head 8 in layer 5, each student with its corresponding head KL-divergence loss. The KL-divergence loss for the given example across all matching layers between the students and the teacher is 2.229 and 0.085 for the standard KD and internal KD students, respectively.