Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

Khanh-Binh Nguyen; Chae Jung Park

Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

Khanh-Binh Nguyen, Chae Jung Park

TL;DR

This work tackles efficient embedding distillation for lightweight models in self-supervised learning by reusing the teacher's projection head (Retro). By introducing a dimension adapter and mean-student symmetry, Retro aligns student embeddings with a fixed teacher projection, achieving state-of-the-art linear ImageNet results across multiple light architectures and showing robust transfer to detection and segmentation. The approach reduces trainable parameters and avoids projection-head design guesswork, while incurring no inference overhead. Overall, Retro highlights the projection head as a critical source of transferable knowledge in SSL distillation.

Abstract

Self-supervised learning (SSL) is gaining attention for its ability to learn effective representations with large amounts of unlabeled data. Lightweight models can be distilled from larger self-supervised pre-trained models using contrastive and consistency constraints. Still, the different sizes of the projection heads make it challenging for students to mimic the teacher's embedding accurately. We propose \textsc{Retro}, which reuses the teacher's projection head for students, and our experimental results demonstrate significant improvements over the state-of-the-art on all lightweight models. For instance, when training EfficientNet-B0 using ResNet-50/101/152 as teachers, our approach improves the linear result on ImageNet to $66.9\%$, $69.3\%$, and $69.8\%$, respectively, with significantly fewer parameters.

Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

TL;DR

Abstract

, and

, respectively, with significantly fewer parameters.

Paper Structure (23 sections, 5 equations, 5 figures, 7 tables)

This paper contains 23 sections, 5 equations, 5 figures, 7 tables.

Introduction
Related Work
Self-supervised Learning
Knowledge Distillation
Method
Self-supervised Learning and Knowledge Distillation
Preliminary on Contrastive Learning Based SSL
Contrastive Learning Based SSL
DisCo
Retro
Loss function and parameter update process
Experiments
Implementation Details
Linear Evaluation
Semi-supervised Linear Evaluation
...and 8 more sections

Figures (5)

Figure 1: ImageNet top-1 linear evaluation accuracy on different network architectures. Our method significantly exceeds the result of using MoCo-V2 directly and surpasses the state-of-the-art DisCo by a large margin. Particularly, the result of EfficientNet-B0 is quite close to the teacher ResNet-50, while the number of parameters of EfficientNet-B0 is only 16.3% of ResNet-50. The improvement brought by Retro is compared to the MoCo-V2 baseline.
Figure 2: Comparison with existing self-supervised distillers. $x$ is the input image. The orange arrow indicates the knowledge transfer direction. Both \ref{['fig:compress']} CompRess abbasi2020compress and \ref{['fig:seed']} SEED fang2021seed transfer the knowledge of the similarity between a sample and a negative memory bank. \ref{['fig:disco']} DisCo gao2022disco constrains the last embedding of the student to be consistent with that of the teacher. \ref{['fig:proposed']} Our Retro improves DisCo by reusing the teacher projection head for the student, which has a higher capability to generate generalized embedding. 'Adt.' indicates the adapter layer.
Figure 3: The pipeline of the proposed Retro technique. Two different data augmentation techniques first transform a single image into two views. A self-supervised pre-trained teacher is added in addition to the original contrastive SSL component, and the final embeddings generated by the learnable student and the frozen teacher must be consistent for each view.
Figure 4: Top-1 accuracy on CIFAR-10 (\ref{['fig:cifar10-res18']}, \ref{['fig:cifar10-effb0']}) and CIFAR-100 (\ref{['fig:cifar100-res18']}, \ref{['fig:cifar100-effb0']}) dataset.
Figure 5: Adapter structure for different student networks.

Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

TL;DR

Abstract

Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)