Table of Contents
Fetching ...

Embedding Compression for Teacher-to-Student Knowledge Transfer

Yiwei Ding, Alexander Lerch

TL;DR

This work addresses the challenge of transferring knowledge via teacher embeddings when source and target tasks differ, by introducing an embedding compression module that applies a trainable transformation to produce compact, task-relevant teacher representations. A distance-based distillation objective aligns the student’s features with these compact embeddings, while keeping the teacher parameters fixed; the loss $L_ ext{teacher}$ updates only the transformation. Evaluations across music auto-tagging and related tasks show that embedding compression yields performance gains, especially for unsupervised teacher embeddings, and enhances generalizability to out-of-domain data, all with manageable complexity. The approach broadens the applicability of embedding-based knowledge transfer beyond closely related tasks and demonstrates practical benefits for resource-constrained settings.

Abstract

Common knowledge distillation methods require the teacher model and the student model to be trained on the same task. However, the usage of embeddings as teachers has also been proposed for different source tasks and target tasks. Prior work that uses embeddings as teachers ignores the fact that the teacher embeddings are likely to contain irrelevant knowledge for the target task. To address this problem, we propose to use an embedding compression module with a trainable teacher transformation to obtain a compact teacher embedding. Results show that adding the embedding compression module improves the classification performance, especially for unsupervised teacher embeddings. Moreover, student models trained with the guidance of embeddings show stronger generalizability.

Embedding Compression for Teacher-to-Student Knowledge Transfer

TL;DR

This work addresses the challenge of transferring knowledge via teacher embeddings when source and target tasks differ, by introducing an embedding compression module that applies a trainable transformation to produce compact, task-relevant teacher representations. A distance-based distillation objective aligns the student’s features with these compact embeddings, while keeping the teacher parameters fixed; the loss updates only the transformation. Evaluations across music auto-tagging and related tasks show that embedding compression yields performance gains, especially for unsupervised teacher embeddings, and enhances generalizability to out-of-domain data, all with manageable complexity. The approach broadens the applicability of embedding-based knowledge transfer beyond closely related tasks and demonstrates practical benefits for resource-constrained settings.

Abstract

Common knowledge distillation methods require the teacher model and the student model to be trained on the same task. However, the usage of embeddings as teachers has also been proposed for different source tasks and target tasks. Prior work that uses embeddings as teachers ignores the fact that the teacher embeddings are likely to contain irrelevant knowledge for the target task. To address this problem, we propose to use an embedding compression module with a trainable teacher transformation to obtain a compact teacher embedding. Results show that adding the embedding compression module improves the classification performance, especially for unsupervised teacher embeddings. Moreover, student models trained with the guidance of embeddings show stronger generalizability.
Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Illustration of irrelevant knowledge in teacher embeddings, which might make the knowledge transfer from the embeddings to the student models problematic. It is caused by the dissimilarity between the source task and the target task.
  • Figure 2: Different pipelines in training. The black arrows indicate the forward path and the orange arrows show the gradient flow in back propagation.
  • Figure 3: Results on MagnaTagATune dataset with (a) SC-CNN and (b) HCNN. Better viewed in color. The red dashed line is the baseline result. The gray bars are the results of TeacherLR. The orange bars and green bars are FitNet and distance correlation respectively. Slashed bars are those with embedding compression.