Table of Contents
Fetching ...

You Don't Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning

Théo Moutakanni, Maxime Oquab, Marc Szafraniec, Maria Vakalopoulou, Piotr Bojanowski

TL;DR

The paper investigates whether hand-crafted data augmentations are necessary for self-supervised learning with joint-embedding architectures (JEAs) at scale. By re-running DINOv2 under varied augmentation regimes and across large pretraining datasets, it shows that dataset size and distribution primarily drive performance, and that cropping without resizing can suffice when data and compute are ample. The authors demonstrate a near-state-of-the-art result without traditional augmentations, highlight the nuanced role of scaling laws, and show that augmentation effects diminish with larger data budgets. These findings challenge the prevailing assumption that invariance learned via augmentations is fundamental to JEAs and suggest a broader, less augmentation-dependent path for SSL research and applications.

Abstract

Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has led to outstanding performances. All instantiations of this paradigm were trained using strong and well-established hand-crafted data augmentations, leading to the general belief that they are required for the proper training and performance of such models. On the other hand, generative reconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive Architectures such as I-JEPA have shown strong performance without using data augmentations except masking. In this work, we challenge the importance of invariance and data-augmentation in JEAs at scale. By running a case-study on a recent SSL foundation model - DINOv2 - we show that strong image representations can be obtained with JEAs and only cropping without resizing provided the training data is large enough, reaching state-of-the-art results and using the least amount of augmentation in the literature. Through this study, we also discuss the impact of compute constraints on the outcomes of experimental deep learning research, showing that they can lead to very different conclusions.

You Don't Need Domain-Specific Data Augmentations When Scaling Self-Supervised Learning

TL;DR

The paper investigates whether hand-crafted data augmentations are necessary for self-supervised learning with joint-embedding architectures (JEAs) at scale. By re-running DINOv2 under varied augmentation regimes and across large pretraining datasets, it shows that dataset size and distribution primarily drive performance, and that cropping without resizing can suffice when data and compute are ample. The authors demonstrate a near-state-of-the-art result without traditional augmentations, highlight the nuanced role of scaling laws, and show that augmentation effects diminish with larger data budgets. These findings challenge the prevailing assumption that invariance learned via augmentations is fundamental to JEAs and suggest a broader, less augmentation-dependent path for SSL research and applications.

Abstract

Self-Supervised learning (SSL) with Joint-Embedding Architectures (JEA) has led to outstanding performances. All instantiations of this paradigm were trained using strong and well-established hand-crafted data augmentations, leading to the general belief that they are required for the proper training and performance of such models. On the other hand, generative reconstruction-based models such as BEIT and MAE or Joint-Embedding Predictive Architectures such as I-JEPA have shown strong performance without using data augmentations except masking. In this work, we challenge the importance of invariance and data-augmentation in JEAs at scale. By running a case-study on a recent SSL foundation model - DINOv2 - we show that strong image representations can be obtained with JEAs and only cropping without resizing provided the training data is large enough, reaching state-of-the-art results and using the least amount of augmentation in the literature. Through this study, we also discuss the impact of compute constraints on the outcomes of experimental deep learning research, showing that they can lead to very different conclusions.
Paper Structure (13 sections, 2 equations, 4 figures, 4 tables)

This paper contains 13 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Top: Visual description of pretraining losses. In blue: the local to global DINO loss, in red: the global to global DINO loss and in green the latent masked token prediction (iBOT) loss. Bottom: Our different augmentation strategies. 'Original' uses several augmentations (RandomResizedCrop, ColorJitter, RandomGrayscale, GaussianBlur, RandomHorizontalFlip and RandomSolarize), 'Shared' uses the same augmentations but shares them between each view of the same image obtained with RandomResizedCrop. The 'Crop + Resize' setting only uses RandomResizedCrop. We also introduce a 'Crop' setup which uses RandomCrop without random rescaling and that is visually similar to 'Crop + Resize'.
  • Figure 2: Impact of dataset size when varying data augmentations. Results of ViT-L on linear evaluation benchmarks, including classification (ImageNet1k, Places 205 and INaturalist18), depth estimation (NYU-Depth) and segmentation (ADE20k). Cropping without resizing ('Crop') reaches very high performances on a wide variety of benchmarks, given that the dataset size is large enough.
  • Figure 3: Impact of data augmentations when we scale the number of training epoch (left) or the ViT architecture size (right) on the accuracy of a linear probe on ImageNet1k for a ViT-L when pretraining on ImageNet1k, ImageNet22k and LVD-142M. The 'Original' and 'Shared' setups scale with the number of epochs for all datasets, but the 'Crop' and 'Crop+Resize' setups only scale with larger datasets.
  • Figure 4: (left): Impact of hyper-parameter optimisation's target compute on the accuracy of a linear probe on ImageNet1k and ADE20k for models trained on ImageNet1k. We can see that optimising for high compute leads to poor performances on the 'Crop' and 'Crop+Resize' settings, which is the opposite of our findings when we optimize for low compute. (right): Impact of hyperparameter tuning using 'Crop+Resize' on ImageNet1k for 100 epochs. For each line, we switch only one hyper-parameter from the configuration optimized on ImageNet22k for 500 epochs (High compute) to the one optimized for ImageNet1k for 100 epochs (Low).