$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

Roy Miles; Ismail Elezi; Jiankang Deng

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

Roy Miles, Ismail Elezi, Jiankang Deng

TL;DR

A novel constrained feature distillation method derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation that can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods.

Abstract

Knowledge distillation is an effective method for training small and efficient deep learning models. However, the efficacy of a single method can degenerate when transferring to other tasks, modalities, or even other architectures. To address this limitation, we propose a novel constrained feature distillation method. This method is derived from a small set of core principles, which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components, our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method, we apply it to object detection and image generation, whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available: https://github.com/roymiles/vkd

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

TL;DR

Abstract

Paper Structure (38 sections, 12 equations, 11 figures, 9 tables)

This paper contains 38 sections, 12 equations, 11 figures, 9 tables.

Introduction
Related work
Orthogonal Projections
Why use orthogonal projections?
Orthogonal reparameterisation
Orthogonal projections minimise redundancy
Introducing domain-specific priors
Standardisation improves model convergence.
Whitening improves feature diversity.
Experiments
Implementation details.
Data efficient training of transformers
Object detection
Data limited image generation
Ablation study
...and 23 more sections

Figures (11)

Figure 1: Comparison to both DeiT and DeiTTouvron2021TrainingAttention on ImageNet-1K, where DeiTis a distilled DeiT model using a distillation token. Our proposed distillation method achieves significant improvements over DeiT-Ti, while effectively bridging the gap between the teacher and student performance for DeiT-S.
Figure 2: Illustration of our proposed feature distillation using an orthonormal projection and task-specific feature normalisation. The orthonormal projection (a) maximises the knowledge being distilled to the student backbone, while the task-specific normalisation (b) can introduce domain-specific priors to improve model performance. denotes trainable weights, while denotes weights which are frozen.
Figure 3: t-SNE visualisation tsne of features undergoing either a linear or orthogonal transformation. The orthogonal transformation preserves all of the structural feature information, whereas the linear projection can distort a lot of structure, which can diminish the efficacy of distillation.
Figure 4: Visualisation of the $V_k$D-Ti $\mathcal{L}_{distill}$ loss landscape with perturbations of the input image across two random dimensions. Normalisation significantly reduces the sensitivity of the loss to random perturbations, which leads to improved robustness and convergence for training.
Figure 5: Comparing the performance and convergence of various projector reparameterisations. Although the MLP layer initially trains fast, it begins to saturate as it starts to learn a new representation of the data.
...and 6 more figures

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

TL;DR

Abstract

$V_kD:$ Improving Knowledge Distillation using Orthogonal Projections

Authors

TL;DR

Abstract

Table of Contents

Figures (11)