Table of Contents
Fetching ...

Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

Khanh-Binh Nguyen, Chae Jung Park

TL;DR

This work tackles efficient embedding distillation for lightweight models in self-supervised learning by reusing the teacher's projection head (Retro). By introducing a dimension adapter and mean-student symmetry, Retro aligns student embeddings with a fixed teacher projection, achieving state-of-the-art linear ImageNet results across multiple light architectures and showing robust transfer to detection and segmentation. The approach reduces trainable parameters and avoids projection-head design guesswork, while incurring no inference overhead. Overall, Retro highlights the projection head as a critical source of transferable knowledge in SSL distillation.

Abstract

Self-supervised learning (SSL) is gaining attention for its ability to learn effective representations with large amounts of unlabeled data. Lightweight models can be distilled from larger self-supervised pre-trained models using contrastive and consistency constraints. Still, the different sizes of the projection heads make it challenging for students to mimic the teacher's embedding accurately. We propose \textsc{Retro}, which reuses the teacher's projection head for students, and our experimental results demonstrate significant improvements over the state-of-the-art on all lightweight models. For instance, when training EfficientNet-B0 using ResNet-50/101/152 as teachers, our approach improves the linear result on ImageNet to $66.9\%$, $69.3\%$, and $69.8\%$, respectively, with significantly fewer parameters.

Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning

TL;DR

This work tackles efficient embedding distillation for lightweight models in self-supervised learning by reusing the teacher's projection head (Retro). By introducing a dimension adapter and mean-student symmetry, Retro aligns student embeddings with a fixed teacher projection, achieving state-of-the-art linear ImageNet results across multiple light architectures and showing robust transfer to detection and segmentation. The approach reduces trainable parameters and avoids projection-head design guesswork, while incurring no inference overhead. Overall, Retro highlights the projection head as a critical source of transferable knowledge in SSL distillation.

Abstract

Self-supervised learning (SSL) is gaining attention for its ability to learn effective representations with large amounts of unlabeled data. Lightweight models can be distilled from larger self-supervised pre-trained models using contrastive and consistency constraints. Still, the different sizes of the projection heads make it challenging for students to mimic the teacher's embedding accurately. We propose \textsc{Retro}, which reuses the teacher's projection head for students, and our experimental results demonstrate significant improvements over the state-of-the-art on all lightweight models. For instance, when training EfficientNet-B0 using ResNet-50/101/152 as teachers, our approach improves the linear result on ImageNet to , , and , respectively, with significantly fewer parameters.
Paper Structure (23 sections, 5 equations, 5 figures, 7 tables)

This paper contains 23 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: ImageNet top-1 linear evaluation accuracy on different network architectures. Our method significantly exceeds the result of using MoCo-V2 directly and surpasses the state-of-the-art DisCo by a large margin. Particularly, the result of EfficientNet-B0 is quite close to the teacher ResNet-50, while the number of parameters of EfficientNet-B0 is only 16.3% of ResNet-50. The improvement brought by Retro is compared to the MoCo-V2 baseline.
  • Figure 2: Comparison with existing self-supervised distillers. $x$ is the input image. The orange arrow indicates the knowledge transfer direction. Both \ref{['fig:compress']} CompRess abbasi2020compress and \ref{['fig:seed']} SEED fang2021seed transfer the knowledge of the similarity between a sample and a negative memory bank. \ref{['fig:disco']} DisCo gao2022disco constrains the last embedding of the student to be consistent with that of the teacher. \ref{['fig:proposed']} Our Retro improves DisCo by reusing the teacher projection head for the student, which has a higher capability to generate generalized embedding. 'Adt.' indicates the adapter layer.
  • Figure 3: The pipeline of the proposed Retro technique. Two different data augmentation techniques first transform a single image into two views. A self-supervised pre-trained teacher is added in addition to the original contrastive SSL component, and the final embeddings generated by the learnable student and the frozen teacher must be consistent for each view.
  • Figure 4: Top-1 accuracy on CIFAR-10 (\ref{['fig:cifar10-res18']}, \ref{['fig:cifar10-effb0']}) and CIFAR-100 (\ref{['fig:cifar100-res18']}, \ref{['fig:cifar100-effb0']}) dataset.
  • Figure 5: Adapter structure for different student networks.