Table of Contents
Fetching ...

Scaling and Benchmarking Self-Supervised Visual Representation Learning

Priya Goyal, Dhruv Mahajan, Abhinav Gupta, Ishan Misra

TL;DR

The paper investigates scaling self-supervised visual representations to 100M images by examining three axes—data size, model capacity, and task hardness—using two main pretext tasks (Jigsaw and Colorization). It demonstrates that larger data and higher-capacity models enable meaningful transfer gains, with harder pretext tasks offering additional improvements, especially for deeper networks. A comprehensive 9-task benchmark is proposed to assess representation quality across diverse domains, showing self-supervised features outperform supervised baselines on some geometry and navigation tasks while remaining competitive on object detection and lagging on semantic classification. The work highlights the need for harder, more domain-aligned pretext tasks and standardized evaluation to drive progress in self-supervised learning.

Abstract

Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images. We show that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation (3D) and visual navigation using reinforcement learning. Scaling these methods also provides many interesting insights into the limitations of current self-supervised techniques and evaluations. We conclude that current self-supervised methods are not 'hard' enough to take full advantage of large scale data and do not seem to learn effective high level semantic representations. We also introduce an extensive benchmark across 9 different datasets and tasks. We believe that such a benchmark along with comparable evaluation settings is necessary to make meaningful progress. Code is at: https://github.com/facebookresearch/fair_self_supervision_benchmark.

Scaling and Benchmarking Self-Supervised Visual Representation Learning

TL;DR

The paper investigates scaling self-supervised visual representations to 100M images by examining three axes—data size, model capacity, and task hardness—using two main pretext tasks (Jigsaw and Colorization). It demonstrates that larger data and higher-capacity models enable meaningful transfer gains, with harder pretext tasks offering additional improvements, especially for deeper networks. A comprehensive 9-task benchmark is proposed to assess representation quality across diverse domains, showing self-supervised features outperform supervised baselines on some geometry and navigation tasks while remaining competitive on object detection and lagging on semantic classification. The work highlights the need for harder, more domain-aligned pretext tasks and standardized evaluation to drive progress in self-supervised learning.

Abstract

Self-supervised learning aims to learn representations from the data itself without explicit manual supervision. Existing efforts ignore a crucial aspect of self-supervised learning - the ability to scale to large amount of data because self-supervision requires no manual labels. In this work, we revisit this principle and scale two popular self-supervised approaches to 100 million images. We show that by scaling on various axes (including data size and problem 'hardness'), one can largely match or even exceed the performance of supervised pre-training on a variety of tasks such as object detection, surface normal estimation (3D) and visual navigation using reinforcement learning. Scaling these methods also provides many interesting insights into the limitations of current self-supervised techniques and evaluations. We conclude that current self-supervised methods are not 'hard' enough to take full advantage of large scale data and do not seem to learn effective high level semantic representations. We also introduce an extensive benchmark across 9 different datasets and tasks. We believe that such a benchmark along with comparable evaluation settings is necessary to make meaningful progress. Code is at: https://github.com/facebookresearch/fair_self_supervision_benchmark.

Paper Structure

This paper contains 59 sections, 7 figures, 32 tables.

Figures (7)

  • Figure 1: Scaling the Pre-training Data Size: The transfer learning performance of self-supervised methods on the VOC07 dataset for AlexNet and ResNet-50 as we vary the pre-training data size. We keep the problem complexity and data domain (different sized subsets of YFCC-100M) fixed. More details in \ref{['sec:scaling_data']}.
  • Figure 2: Scaling Problem Complexity: We evaluate transfer learning performance of Jigsaw and Colorization approaches on VOC07 dataset for both AlexNet and ResNet-50 as we vary the problem complexity. The pre-training data is fixed at YFCC-1M (\ref{['sec:scaling_problem']}) to isolate the effect of problem complexity.
  • Figure 3: Scaling Data and Problem Complexity: We vary the pre-training data size and Jigsaw problem complexity for both AlexNet and ResNet-50 models. We pre-train on two datasets: ImageNet and YFCC and evaluate transfer learning performance on VOC07 dataset.
  • Figure 4: Relationship between pre-training and transfer domain: We vary pre-training data domain - (ImageNet-[1k, 22k], subsets of YFCC-100M) and observe transfer performance on the VOC07 and Places205 classification tasks. The similarity between the pre-training and transfer task domain shows a strong influence on transfer performance.
  • Figure 5: Low-shot Image Classification on the VOC07 and Places205 datasets using linear SVMs trained on the features from the best performing layer for ResNet-50. We vary the number of labeled examples (per class) used to train the classifier and report the performance on the test set. We show the mean and standard deviation across five runs (\ref{['sec:lowshot']}).
  • ...and 2 more figures