Self-supervised visual learning in the low-data regime: a comparative evaluation

Sotirios Konstantakos; Jorgen Cani; Ioannis Mademlis; Despina Ioanna Chalkiadaki; Yuki M. Asano; Efstratios Gavves; Georgios Th. Papadopoulos

Self-supervised visual learning in the low-data regime: a comparative evaluation

Sotirios Konstantakos, Jorgen Cani, Ioannis Mademlis, Despina Ioanna Chalkiadaki, Yuki M. Asano, Efstratios Gavves, Georgios Th. Papadopoulos

TL;DR

This work systematically evaluates self-supervised visual learning in the low-data regime (approximately $50k$ to $300k$ pretraining images) across four SSL categories: contrastive, generative, clustering, and self-distillation. By pairing main, robustness, and domain-specific experiments on diverse datasets, the study finds a practical data-size threshold around $50k$-$65k$ images below which SSL gains diminish, and shows that in-domain, domain-specific SSL pretraining can outperform large-scale supervised pretraining for specialized tasks. Across transfer and domain-shift scenarios, DINO-based SSL often offers the strongest general embeddings among SSL methods, though conventional supervised pretraining or DINOv2 baselines frequently exceed SSL gains. Domain-specific results highlight the value of in-domain SSL in fields like medical and security imaging, while remote sensing favors large-scale supervised pretraining, suggesting that tailoring pretraining data to the target domain is crucial for maximizing performance with limited labeled data.

Abstract

Self-Supervised Learning (SSL) is a valuable and robust training methodology for contemporary Deep Neural Networks (DNNs), enabling unsupervised pretraining on a 'pretext task' that does not require ground-truth labels/annotation. This allows efficient representation learning from massive amounts of unlabeled training data, which in turn leads to increased accuracy in a 'downstream task' by exploiting supervised transfer learning. Despite the relatively straightforward conceptualization and applicability of SSL, it is not always feasible to collect and/or to utilize very large pretraining datasets, especially when it comes to real-world application settings. In particular, in cases of specialized and domain-specific application scenarios, it may not be achievable or practical to assemble a relevant image pretraining dataset in the order of millions of instances or it could be computationally infeasible to pretrain at this scale, e.g., due to unavailability of sufficient computational resources that SSL methods typically require to produce improved visual analysis results. This situation motivates an investigation on the effectiveness of common SSL pretext tasks, when the pretraining dataset is of relatively limited/constrained size. This work briefly introduces the main families of modern visual SSL methods and, subsequently, conducts a thorough comparative experimental evaluation in the low-data regime, targeting to identify: a) what is learnt via low-data SSL pretraining, and b) how do different SSL categories behave in such training scenarios. Interestingly, for domain-specific downstream tasks, in-domain low-data SSL pretraining outperforms the common approach of large-scale pretraining on general datasets.

Self-supervised visual learning in the low-data regime: a comparative evaluation

TL;DR

This work systematically evaluates self-supervised visual learning in the low-data regime (approximately

pretraining images) across four SSL categories: contrastive, generative, clustering, and self-distillation. By pairing main, robustness, and domain-specific experiments on diverse datasets, the study finds a practical data-size threshold around

images below which SSL gains diminish, and shows that in-domain, domain-specific SSL pretraining can outperform large-scale supervised pretraining for specialized tasks. Across transfer and domain-shift scenarios, DINO-based SSL often offers the strongest general embeddings among SSL methods, though conventional supervised pretraining or DINOv2 baselines frequently exceed SSL gains. Domain-specific results highlight the value of in-domain SSL in fields like medical and security imaging, while remote sensing favors large-scale supervised pretraining, suggesting that tailoring pretraining data to the target domain is crucial for maximizing performance with limited labeled data.

Abstract

Paper Structure (38 sections, 19 equations, 11 figures, 13 tables, 5 algorithms)

This paper contains 38 sections, 19 equations, 11 figures, 13 tables, 5 algorithms.

Introduction
Pretext Tasks
Main categories of SSL pretext tasks for images
Discussion on the behavior of the SSL pretraining categories
Comparative Evaluation Framework in the Low-Data Regime
Main experiments
Robustness experiments
Domain-specific experiments
Employed datasets
Experimental Results and Insights
Main experimental results
Robustness experimental results
Domain-specific experimental results
Explanation of learnt representations
Summary of Key Experimental Insights
...and 23 more sections

Figures (11)

Figure 1: Conceptualization of the SSL paradigm
Figure 2: Abstract indicative illustrations of the 4 main categories of SSL pretext tasks
Figure 3: Example images from the MLRSNet remote sensing dataset qi2020mlrsnet belonging to classes a) 'Airport' and b) 'Beach'
Figure 4: Example images from the MedPix medical imaging dataset MedPixDataset belonging to classes a) 'Cholesterol granuloma' and b) 'Cleidocranial Dysostosis, Dysplasia'
Figure 5: Example images from the SIXRay security imaging dataset Miao2019SIXray belonging to classes a) 'Gun' and b) 'Knife'
...and 6 more figures

Self-supervised visual learning in the low-data regime: a comparative evaluation

TL;DR

Abstract

Self-supervised visual learning in the low-data regime: a comparative evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)