Table of Contents
Fetching ...

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

Roy Miles, Ismail Elezi, Jiankang Deng

TL;DR

A novel constrained feature distillation method derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation that can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods.

Abstract

Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

TL;DR

A novel constrained feature distillation method derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation that can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods.

Abstract

Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd
Paper Structure (38 sections, 12 equations, 11 figures, 9 tables)

This paper contains 38 sections, 12 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Comparison to both DeiT and DeiTTouvron2021TrainingAttention on ImageNet-1K, where DeiTis a distilled DeiT model using a distillation token. Our proposed distillation method achieves significant improvements over DeiT-Ti, while effectively bridging the gap between the teacher and student performance for DeiT-S.
  • Figure 2: Illustration of our proposed feature distillation using an orthonormal projection and task-specific feature normalisation. The orthonormal projection (a) maximises the knowledge being distilled to the student backbone, while the task-specific normalisation (b) can introduce domain-specific priors to improve model performance. denotes trainable weights, while denotes weights which are frozen.
  • Figure 3: t-SNE visualisation tsne of features undergoing either a linear or orthogonal transformation. The orthogonal transformation preserves all of the structural feature information, whereas the linear projection can distort a lot of structure, which can diminish the efficacy of distillation.
  • Figure 4: Visualisation of the $V_k$D-Ti $\mathcal{L}_{distill}$ loss landscape with perturbations of the input image across two random dimensions. Normalisation significantly reduces the sensitivity of the loss to random perturbations, which leads to improved robustness and convergence for training.
  • Figure 5: Comparing the performance and convergence of various projector reparameterisations. Although the MLP layer initially trains fast, it begins to saturate as it starts to learn a new representation of the data.
  • ...and 6 more figures