Table of Contents
Fetching ...

Investigating Self-Supervised Methods for Label-Efficient Learning

Srinivasa Rao Nandam, Sara Atito, Zhenhua Feng, Josef Kittler, Muhammad Awais

TL;DR

This work conducts a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models and introduces a framework that combines mask image modelling and clustering as pretext tasks.

Abstract

Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of these models, across several low-shot downstream tasks, has been largely under explored. We perform a system level study of different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities by comparing the pretrained models. In addition we also study the effects of collapse avoidance methods, namely centring, ME-MAX, sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.

Investigating Self-Supervised Methods for Label-Efficient Learning

TL;DR

This work conducts a systematic examination of different self-supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling, to assess their low-shot capabilities by comparing different pretrained models and introduces a framework that combines mask image modelling and clustering as pretext tasks.

Abstract

Vision transformers combined with self-supervised learning have enabled the development of models which scale across large datasets for several downstream tasks like classification, segmentation and detection. The low-shot learning capability of these models, across several low-shot downstream tasks, has been largely under explored. We perform a system level study of different self supervised pretext tasks, namely contrastive learning, clustering, and masked image modelling for their low-shot capabilities by comparing the pretrained models. In addition we also study the effects of collapse avoidance methods, namely centring, ME-MAX, sinkhorn, on these downstream tasks. Based on our detailed analysis, we introduce a framework involving both mask image modelling and clustering as pretext tasks, which performs better across all low-shot downstream tasks, including multi-class classification, multi-label classification and semantic segmentation. Furthermore, when testing the model on full scale datasets, we show performance gains in multi-class classification, multi-label classification and semantic segmentation.

Paper Structure

This paper contains 11 sections, 5 equations, 2 figures, 8 tables.

Figures (2)

  • Figure 1: Accuracy of different methods on low-shot classification task for 1, 2, 5 images, and 1% of ImageNet-1K.
  • Figure 2: The MaskCluster architecture for low-shot learning generates multiple masked global views. A teacher encoder, employing transformer layers, produces embeddings $\mathbf{Z}^{e, G}_{c}$ (class tokens) and $\mathbf{Z}^{e, G}_{p}$ (patch tokens) from unmasked global views. Teacher clustering layers $CL_t$ and $PL_t$ assign clusters $\mathbf{P}^{G}_{c}$ and $\mathbf{P}^{G}_{p}$ based on these embeddings. The student encoder similarly processes masked global crops to produce embeddings $\mathbf{\bar{Z}}^{e, G}_{c}$ and $\mathbf{\bar{Z}}^{e, G}_{p}$, which are clustered by student layers $CL_s$ and $PL_s$ to generate assignments $\mathbf{\bar{P}}^{G}_{c}$ and $\mathbf{\bar{P}}^{G}_{p}$. Additionally, $\mathbf{\bar{Z}}^{e, G}_{p}$ undergoes reconstruction to the pixel space using a student's reconstruction head. Losses include cross-entropy between teacher and student cluster assignments, and $\ell$1-loss for reconstructing the original view from the image reconstruction.