Simple Unsupervised Knowledge Distillation With Space Similarity

Aditya Singh; Haohan Wang

Simple Unsupervised Knowledge Distillation With Space Similarity

Aditya Singh, Haohan Wang

TL;DR

This work tackles the challenge of transferring knowledge from a self-supervised teacher to a smaller student without labels. It introduces CoSS, a two-part objective that combines feature-level cosine alignment with a novel space similarity term to align the teacher's embedding manifold with the student's, preserving spatial structure despite $L_2$ normalization. The method uses an offline k-nearest neighbor pre-processing step to capture local manifold structure and a simple online distillation loss $\\mathcal{L}_{CoSS} = \mathcal{L}_{co} + \lambda \mathcal{L}_{ss}$, achieving state-of-the-art or competitive results across ImageNet classification, transfer learning, dense prediction, retrieval, and robustness benchmarks. The results demonstrate the practicality of manifold-aware UKD for compact models without requiring feature queues or heavy augmentations, with potential applicability to other domains.

Abstract

As per recent studies, Self-supervised learning (SSL) does not readily extend to smaller architectures. One direction to mitigate this shortcoming while simultaneously training a smaller network without labels is to adopt unsupervised knowledge distillation (UKD). Existing UKD approaches handcraft preservation worthy inter/intra sample relationships between the teacher and its student. However, this may overlook/ignore other key relationships present in the mapping of a teacher. In this paper, instead of heuristically constructing preservation worthy relationships between samples, we directly motivate the student to model the teacher's embedding manifold. If the mapped manifold is similar, all inter/intra sample relationships are indirectly conserved. We first demonstrate that prior methods cannot preserve teacher's latent manifold due to their sole reliance on $L_2$ normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbf{space similarity}, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.

Simple Unsupervised Knowledge Distillation With Space Similarity

TL;DR

normalization. The method uses an offline k-nearest neighbor pre-processing step to capture local manifold structure and a simple online distillation loss

, achieving state-of-the-art or competitive results across ImageNet classification, transfer learning, dense prediction, retrieval, and robustness benchmarks. The results demonstrate the practicality of manifold-aware UKD for compact models without requiring feature queues or heavy augmentations, with potential applicability to other domains.

Abstract

normalised embedding features. Subsequently, we propose a simple objective to capture the lost information due to normalisation. Our proposed loss component, termed \textbf{space similarity}, motivates each dimension of a student's feature space to be similar to the corresponding dimension of its teacher. We perform extensive experiments demonstrating strong performance of our proposed approach on various benchmarks.

Paper Structure (26 sections, 6 equations, 3 figures, 14 tables, 1 algorithm)

This paper contains 26 sections, 6 equations, 3 figures, 14 tables, 1 algorithm.

Introduction
Related Work
Logit Based Distillation
Feature Based Distillation
Key Differences
Motivation
Method
Offline Pre-processing
Training Objectives
Experiments
Settings
Supervised Classification
Transfer Learning
Dense Predictions
Image Retrieval
...and 11 more sections

Figures (3)

Figure 1: The proposed CoSS (feature similarity + space similarity) distillation framework. In the graphic, we demonstrate similarity of one pair of corresponding feature dimension being maximised. We perform this maximisation for every corresponding pair for the teacher and student.
Figure 2: The training batch is composed of random samples ($\mathcolor{blue}\star$) and their nearest $k$ samples ($\mathcolor{YellowOrange}\bullet$).
Figure 3: Plots comparing the latent space of the teacher and different students. Visually, we can assess that though SEED is able to separate the input samples adequately, the learnt mapping is not faithful to the teacher. Whereas, adding the space similarity objective to the standard cosine similarity allows the student to learn a mapping which aligns better with its teacher.

Theorems & Definitions (1)

definition thmcounterdefinition

Simple Unsupervised Knowledge Distillation With Space Similarity

TL;DR

Abstract

Simple Unsupervised Knowledge Distillation With Space Similarity

Authors

TL;DR

Abstract

Table of Contents

Figures (3)

Theorems & Definitions (1)