Table of Contents
Fetching ...

Learning to Project for Cross-Task Knowledge Distillation

Dylan Auty, Roy Miles, Benedikt Kolbeinsson, Krystian Mikolajczyk

TL;DR

This method can obtain significant performance improvements when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer.

Abstract

Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.

Learning to Project for Cross-Task Knowledge Distillation

TL;DR

This method can obtain significant performance improvements when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer.

Abstract

Traditional knowledge distillation (KD) relies on a proficient teacher trained on the target task, which is not always available. In this setting, cross-task distillation can be used, enabling the use of any teacher model trained on a different task. However, many KD methods prove ineffective when applied to this cross-task setting. To address this limitation, we propose a simple modification: the use of an inverted projection. We show that this drop-in replacement for a standard projector is effective by learning to disregard any task-specific features which might degrade the student's performance. We find that this simple modification is sufficient for extending many KD methods to the cross-task setting, where the teacher and student tasks can be very different. In doing so, we obtain up to a 1.9% improvement in the cross-task setting compared to the traditional projection, at no additional cost. Our method can obtain significant performance improvements (up to 7%) when using even a randomly-initialised teacher on various tasks such as depth estimation, image translation, and semantic segmentation, despite the lack of any learned knowledge to transfer. To provide conceptual and analytical insights into this result, we show that using an inverted projection allows the distillation loss to be decomposed into a knowledge transfer and a spectral regularisation component. Through this analysis we are additionally able to propose a novel regularisation loss that allows teacher-free distillation, enabling performance improvements of up to 8.57% on ImageNet with no additional training costs.
Paper Structure (15 sections, 4 equations, 3 figures, 6 tables)

This paper contains 15 sections, 4 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Our cross-task knowledge distillation pipeline, where a student model is trained on a target task with the aid of a frozen teacher that is pretrained on a different task. Compared to standard same-task feature distillation (fig. \ref{['fig:overview-same-task']}), our cross-task approach uses an inverted projector (fig. \ref{['fig:overview-ours-cross-task']}) which is able to discard irrelevant task-specific features from the different-task teacher. The loss comprises a feature distillation loss $\mathcal{L}_{distill}$ that matches the student features with the projected teacher features, and a task-specific supervised loss $\mathcal{L}_{task}$ applied only to the student model's output for the target task.
  • Figure 2: Qualitative results on NYUv2 (depth) using different teacher tasks: results from depth estimation, instance segmentation, and randomly-initialised teachers to a MobileNetV2 Fox2018MobileNetV2:Bottlenecks student. In each case, we use the optimal projection type for the teacher task.
  • Figure 3: Evolution of singular values of the projection matrix $\mathbf{P}$ under different cross-task settings and projector types. Green area highlights the rank of $\mathbf{P}$. The projection tends towards a higher rank either when using the traditional projection or when using the same or similar-task teacher. The low-rank when using our inverted projection in the cross-task setting allows irrelevant features to be filtered out, if necessary for the task pair. Top row: traditional projection, Bottom row: our inverted projection.