A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Markus Marks; Manuel Knott; Neehar Kondapaneni; Elijah Cole; Thijs Defraeye; Fernando Perez-Cruz; Pietro Perona

A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Markus Marks, Manuel Knott, Neehar Kondapaneni, Elijah Cole, Thijs Defraeye, Fernando Perez-Cruz, Pietro Perona

TL;DR

Self-supervised learning enables representation learning from unlabeled data, but evaluating SSL methods across downstream tasks is challenging. This study systematically benchmarks 26 SSL models on 11 datasets using multiple evaluation protocols (kNN, linear probing, and end-to-end fine-tuning, including few-shot variants) to analyze ID–OOD correlations. It finds that in-domain linear probing and kNN probing are strong predictors of OOD performance, with 10% few-shot fine-tuning providing a robust proxy for OOD transfer; embedding normalization and backbone architecture strongly influence results, while the discriminative versus generative SSL distinction largely reflects backbone choices. The results offer practical guidance for SSL benchmarking and transferability assessment, highlighting efficient proxies for cross-domain generalization and calling for theory-grounded understanding of SSL evaluation in real-world deployment.

Abstract

Self-supervised learning (SSL) is a machine learning approach where the data itself provides supervision, eliminating the need for external labels. The model is forced to learn about the data structure or context by solving a pretext task. With SSL, models can learn from abundant and cheap unlabeled data, significantly reducing the cost of training models where labels are expensive or inaccessible. In Computer Vision, SSL is widely used as pre-training followed by a downstream task, such as supervised transfer, few-shot learning on smaller labeled data sets, and/or unsupervised clustering. Unfortunately, it is infeasible to evaluate SSL methods on all possible downstream tasks and objectively measure the quality of the learned representation. Instead, SSL methods are evaluated using in-domain evaluation protocols, such as fine-tuning, linear probing, and k-nearest neighbors (kNN). However, it is not well understood how well these evaluation protocols estimate the representation quality of a pre-trained model for different downstream tasks under different conditions, such as dataset, metric, and model architecture. We study how classification-based evaluation protocols for SSL correlate and how well they predict downstream performance on different dataset types. Our study includes eleven common image datasets and 26 models that were pre-trained with different SSL methods or have different model backbones. We find that in-domain linear/kNN probing protocols are, on average, the best general predictors for out-of-domain performance. We further investigate the importance of batch normalization and evaluate how robust correlations are for different kinds of dataset domain shifts. We challenge assumptions about the relationship between discriminative and generative self-supervised methods, finding that most of their performance differences can be explained by changes to model backbones.

A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

TL;DR

Abstract

Paper Structure (41 sections, 11 figures, 8 tables)

This paper contains 41 sections, 11 figures, 8 tables.

Introduction
Related Work
Self-Supervised Learning
Discriminative methods.
Generative methods.
SSL Evaluation Protocols
K-nearest neighbors (kNN).
Linear probing.
End-to-end fine-tuning.
Few-shot fine-tuning.
Studies on SSL Evaluation Protocols
Experimental Setup
Models and protocols.
Correlation analysis.
OOD Datasets.
...and 26 more sections

Figures (11)

Figure 1: SSL application scenarios: We illustrate the following applications of self-supervised learning: a) supervised learning (training and fine-tuning on the same dataset), b) transfer learning (train on a large dataset and fine-tune the model on a ---usually smaller---domain dataset), c) semi-supervised learning (train on a large unlabeled dataset and fine-tune on a small labeled subset of it), d) unsupervised tasks (train on a dataset and run inference with the resulting model on any dataset to create embeddings that can be used for downstream tasks other than classification). Arrows between protocols and applications indicate a direct relationship.
Figure 2: Comparing Spearman rank correlations of top-1 classification accuracies obtained by different evaluation protocols (kNN: k-nearest neighbors, LP: linear probing, FT: fine-tuning, FT-10%: 10%-fine-tuning, FT-1%: 1%-fine-tuning). In-domain (ID) refers to ImageNet-1k, which was also used for pre-training. Out-of-domain (OOD) metrics are averaged over eleven datasets as described in \ref{['sec:experimental-setup']}. In-domain metrics generally correlate highly (left panel), with fine-tuning having the weakest average correlation coefficient. When comparing ID with OOD protocols (right panel), correlation coefficients are visibly lower, indicating a domain-shift effect that impacts the absolute accuracy and the protocols' rank ordering (correlation). A more verbose version of these matrices showing additional protocol variations (with and without feature normalization) is shown in \ref{['fig:corr-matrix-extended']}.
Figure 3: Spearman rank correlations of top-1 classification accuracies derived from in-domain and out-of-domain protocols under certain types of domain shift. We differentiate between fine-grained and coarse-grained categorical domain shifts (left half of each panel). Further, we compare categorical with stylistic domain-shift (right half of each panel). Black rectangles highlight when the same ID and OOD evaluation protocol is used.
Figure 4: Fine-tuning accuracies with and without batch normalization for two exemplary models that appear to have scaled (left, DINO+ResNet-50) and unscaled (right, MaskFeat+ViT-B/16) embedding representations. The x-axes display all datasets included in this study and the number of optimizer steps derived from the dataset size, batch size, and total number of epochs. For MaskFeat, batch normalization has a significant effect when the number of optimizer steps is small and only a small effect when the number of steps is large, implying less-scaled features compared to DINO.
Figure 5: Scatter plot of the correlation of linear-probing and fine-tuning accuracies for ImageNet (in-domain). Each dot represents a model. The color codes for the model family, i.e., blue for discriminative and orange for generative models. Shapes indicate which backbones were used. The dotted line represents the equal error line; the solid line is a linear regression with a 90% confidence interval.
...and 6 more figures

A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

TL;DR

Abstract

A Closer Look at Benchmarking Self-Supervised Pre-training with Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (11)