Table of Contents
Fetching ...

SEED: Self-supervised Distillation For Visual Representation

Zhiyuan Fang, Jianfeng Wang, Lijuan Wang, Lei Zhang, Yezhou Yang, Zicheng Liu

TL;DR

SEED introduces a self-supervised distillation framework that transfers knowledge from a large, SSL-pretrained teacher to a smaller student without using labels. By matching the teacher’s instance-similarity distribution over a dynamically updated queue, SEED enables small architectures to achieve substantially higher ImageNet performance and better transferability than traditional contrastive SSL. The approach is robust across teacher choices and distillation variants, and it improves linear, semi-supervised, and downstream task performance including detection and segmentation. This work highlights a practical path to high-quality visual representations for resource-constrained models, with broad implications for deploying SSL in real-world, small-footprint settings.

Abstract

This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that SEED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the ImageNet-1k dataset.

SEED: Self-supervised Distillation For Visual Representation

TL;DR

SEED introduces a self-supervised distillation framework that transfers knowledge from a large, SSL-pretrained teacher to a smaller student without using labels. By matching the teacher’s instance-similarity distribution over a dynamically updated queue, SEED enables small architectures to achieve substantially higher ImageNet performance and better transferability than traditional contrastive SSL. The approach is robust across teacher choices and distillation variants, and it improves linear, semi-supervised, and downstream task performance including detection and segmentation. This work highlights a practical path to high-quality visual representations for resource-constrained models, with broad implications for deploying SSL in real-world, small-footprint settings.

Abstract

This paper is concerned with self-supervised learning for small models. The problem is motivated by our empirical studies that while the widely used contrastive self-supervised learning method has shown great progress on large model training, it does not work well for small models. To address this problem, we propose a new learning paradigm, named SElf-SupErvised Distillation (SEED), where we leverage a larger network (as Teacher) to transfer its representational knowledge into a smaller architecture (as Student) in a self-supervised fashion. Instead of directly learning from unlabeled data, we train a student encoder to mimic the similarity score distribution inferred by a teacher over a set of instances. We show that SEED dramatically boosts the performance of small networks on downstream tasks. Compared with self-supervised baselines, SEED improves the top-1 accuracy from 42.2% to 67.6% on EfficientNet-B0 and from 36.3% to 68.2% on MobileNet-v3-Large on the ImageNet-1k dataset.

Paper Structure

This paper contains 21 sections, 16 equations, 7 figures, 14 tables.

Figures (7)

  • Figure 1: SEED vs. MoCo-V2 chen2020improved) on ImageNet-1K linear probe accuracy. The vertical axis is the top-1 accuracy and the horizontal axis is the number of learnable parameters for different network architectures. Directly applying self-supervised contrastive learning (MoCo-V2) does not work well for smaller architectures, while our method ( SEED) leads to dramatic performance boost. Details of the setting can be found in Section \ref{['sec:exp']}.
  • Figure 2: Illustration of our self-supervised distillation pipeline. The teacher encoder is pre-trained by SSL and kept frozen during the distillation. The student encoder is trained by minimizing the cross entropy of probabilities from teacher & student for an augmented view of an image, computed over a dynamically maintained queue.
  • Figure 3: ImageNet-1k Top-1 accuracy for semi-supervised evaluations using 1% (red line), 10% (blue line) of the annotations for linear fine-tuning, in comparison with the fully supervised (green line) linear evaluation baseline for SEED. For the points whose Teacher's number of parameters is at 0, we show the semi-supervised linear evaluation results of MoCo-V2 without any distillation. The Student models tend to perform better on the semi-supervised tasks after distillation from larger Teachers.
  • Figure 4: ImageNet-1k Accuracy (%) of student network (EfficientNet-B0 and ResNet-18) transferred to other domains (CIFAR-10, CIFAR-100, SUN-397 datasets) with and without distillation from lager architectures (ResNet-50/101/152).
  • Figure 5: Accuracy (%) of student networks (EfficientNet-b0 and ResNet-18) on ImageNet distilled from wider MoCo-v2 pre-trained ResNet (ResNet-50/101/152$\times$2).
  • ...and 2 more figures