Retro: Reusing teacher projection head for efficient embedding distillation on Lightweight Models via Self-supervised Learning
Khanh-Binh Nguyen, Chae Jung Park
TL;DR
This work tackles efficient embedding distillation for lightweight models in self-supervised learning by reusing the teacher's projection head (Retro). By introducing a dimension adapter and mean-student symmetry, Retro aligns student embeddings with a fixed teacher projection, achieving state-of-the-art linear ImageNet results across multiple light architectures and showing robust transfer to detection and segmentation. The approach reduces trainable parameters and avoids projection-head design guesswork, while incurring no inference overhead. Overall, Retro highlights the projection head as a critical source of transferable knowledge in SSL distillation.
Abstract
Self-supervised learning (SSL) is gaining attention for its ability to learn effective representations with large amounts of unlabeled data. Lightweight models can be distilled from larger self-supervised pre-trained models using contrastive and consistency constraints. Still, the different sizes of the projection heads make it challenging for students to mimic the teacher's embedding accurately. We propose \textsc{Retro}, which reuses the teacher's projection head for students, and our experimental results demonstrate significant improvements over the state-of-the-art on all lightweight models. For instance, when training EfficientNet-B0 using ResNet-50/101/152 as teachers, our approach improves the linear result on ImageNet to $66.9\%$, $69.3\%$, and $69.8\%$, respectively, with significantly fewer parameters.
