Table of Contents
Fetching ...

All You Need in Knowledge Distillation Is a Tailored Coordinate System

Junjie Zhou, Ke Zhu, Jianxin Wu

TL;DR

This paper tackles the inefficiency and rigidity of traditional knowledge distillation by proposing Tailored Coordinate System (TCS), which leverages a SSL-pretrained teacher and a PCA-based coordinate system to capture its dark knowledge. The method computes a task-aligned coordinate system from a single forward pass and refines it via iterative feature selection, optionally enhanced with an efficient eLSH-based mimicking loss. Empirically, TCS achieves higher accuracy than state-of-the-art KD while halving training time and reducing GPU memory by about 25%, and it works across diverse architectures and in practical few-shot learning (pFSL) scenarios with cross-architecture transfer. The approach offers a flexible, teacher-free KD paradigm with wide applicability and potential extensions to detection, segmentation, and large language/multimodal models.

Abstract

Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, and allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.

All You Need in Knowledge Distillation Is a Tailored Coordinate System

TL;DR

This paper tackles the inefficiency and rigidity of traditional knowledge distillation by proposing Tailored Coordinate System (TCS), which leverages a SSL-pretrained teacher and a PCA-based coordinate system to capture its dark knowledge. The method computes a task-aligned coordinate system from a single forward pass and refines it via iterative feature selection, optionally enhanced with an efficient eLSH-based mimicking loss. Empirically, TCS achieves higher accuracy than state-of-the-art KD while halving training time and reducing GPU memory by about 25%, and it works across diverse architectures and in practical few-shot learning (pFSL) scenarios with cross-architecture transfer. The approach offers a flexible, teacher-free KD paradigm with wide applicability and potential extensions to detection, segmentation, and large language/multimodal models.

Abstract

Knowledge Distillation (KD) is essential in transferring dark knowledge from a large teacher to a small student network, such that the student can be much more efficient than the teacher but with comparable accuracy. Existing KD methods, however, rely on a large teacher trained specifically for the target task, which is both very inflexible and inefficient. In this paper, we argue that a SSL-pretrained model can effectively act as the teacher and its dark knowledge can be captured by the coordinate system or linear subspace where the features lie in. We then need only one forward pass of the teacher, and then tailor the coordinate system (TCS) for the student network. Our TCS method is teacher-free and applies to diverse architectures, works well for KD and practical few-shot learning, and allows cross-architecture distillation with large capacity gap. Experiments show that TCS achieves significantly higher accuracy than state-of-the-art KD methods, while only requiring roughly half of their training time and GPU memory costs.

Paper Structure

This paper contains 20 sections, 10 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Top-1 accuracy, training time and GPU memory consumption on ImageNet-1K. Our TCS achieves the fastest training, smallest GPU memory footprint and highest accuracy among KD methods. In the left figure, x-axis is logarithmic. The teacher is Swin-Tiny and the student is ResNet18.
  • Figure 2: Illustration of existing KD methods and our TCS method. (a) logits-based distillation, where the student learns only from the final predictions of the teacher ($\mathbf{p}^t$); (b) feature-based distillation, in which the student mimics intermediate features of the teacher ($\mathbf{f}^t$), too; and (c) our TCS method, which aligns student features into the teacher's coordinate system and tailor it to the target task by feature selection. TCS is optionally augmented by an eLSH module---an efficient unsupervised feature mimicking method. This figure is best viewed in color.