On Pretraining Data Diversity for Self-Supervised Learning

Hasan Abed Al Kader Hammoud; Tuhin Das; Fabio Pizzati; Philip Torr; Adel Bibi; Bernard Ghanem

On Pretraining Data Diversity for Self-Supervised Learning

Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem

TL;DR

The paper investigates how pretraining data diversity affects self-supervised learning under a fixed compute budget, revealing that diversity boosts performance primarily when the downstream task distribution closely matches the pretraining distribution. Using a compute-normalized framework with budget $\mathcal{C}=N\cdot\mathcal{E}$ and diversity $\mathcal{D}=N/\mathcal{C}$, the authors show that increasing diversity helps for in-distribution transfer, but cannot fully bridge distribution shifts, even with extremely large diverse datasets like YFCC100M. Large-scale results indicate ImageNet-pretrained SSL generally outperforms YFCC100M under the normalized budget, with distribution-distance metrics (FID, VisualDNA) aligning with transfer outcomes; scaling data alone does not compensate for domain gaps. The work emphasizes the need for compute-normalized evaluations and distribution-aware data collection when evaluating SSL, and points to a path where future SSL methods must better leverage diversity to generalize across shifting downstream distributions.

Abstract

We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models are available at https://github.com/hammoudhasan/DiversitySSL

On Pretraining Data Diversity for Self-Supervised Learning

TL;DR

and diversity

, the authors show that increasing diversity helps for in-distribution transfer, but cannot fully bridge distribution shifts, even with extremely large diverse datasets like YFCC100M. Large-scale results indicate ImageNet-pretrained SSL generally outperforms YFCC100M under the normalized budget, with distribution-distance metrics (FID, VisualDNA) aligning with transfer outcomes; scaling data alone does not compensate for domain gaps. The work emphasizes the need for compute-normalized evaluations and distribution-aware data collection when evaluating SSL, and points to a path where future SSL methods must better leverage diversity to generalize across shifting downstream distributions.

Abstract

Paper Structure (26 sections, 2 equations, 5 figures, 11 tables)

This paper contains 26 sections, 2 equations, 5 figures, 11 tables.

Introduction
Related Work
Preliminaries
Normalized Evaluation
Fixed Budget SSL: In & Out-of-Distribution
Performance on the Same Distribution
Increasing Data Diversity
Scaling Pretraining Diversity
Additional Analysis
Discussion
Importance of Normalization of Computation
Increasing Total Computation
Epoch-based normalization
Alternative Settings
Non-Contrastive Methods
...and 11 more sections

Figures (5)

Figure 1: Impact of Pretraining Diversity: Self-supervised learning (SSL) can be used to pretrain vision models on small datasets closely aligned to the downstream task, e.g., pets classification, hence with a small distribution shift (top, wild animals pretraining). Conversely, we could pretrain on an extensively varied dataset, with wide distribution differences (outdoor scenes, bottom). We study the role of pretraining diversity in SSL under a fixed budget, and highlight its effects in relationship to the distribution shift.
Figure 2: Data Collection Strategies: We analyze strategies for collecting additional data ($\mathbb{A}$), i.e., collecting more source data, crawling the web or using synthetic images. Using a class prior (top row) simulates In-distribution trainings. We also collect images without class prior (bottom row) to analyze the interactions between diversity and Out-of-distribution classes.
Figure 3: Effect of Various Data Sources on SSL Pretraining: We use a baseline set $\mathbb{B}$ (black dashed line), comprising $65\times 10^3$ images from ImageNet-100, for pretraining a ResNet-18 with $\mathcal{C}=50\times10^6$. Augmenting $\mathbb{B}$ with In-distribution images enhances performance (above black line), while Out-of-distribution augmentations reduce it (below black line).
Figure 4: Data Diversity Impact on YFCC100M Pretraining Performance: Pretraining ($\mathcal{C}=98\times10^6$) of networks with $\mathbb{D}_\text{SSL}=\text{YFCC100M}$ and $\mathbb{D}_\text{task}=\text{ImageNet}$ for several dataset subsets. In the presence of a distribution shift, performance tends to saturate and does not benefit from additional data diversity.
Figure 5: Impact of Epoch Normalization on SSL Pretraining Performance: This figure contrasts an epoch-normalized baseline (red line) with the trained methods in the main paper, Figure \ref{['fig:barplots']}. Under epoch normalization, we notice contrasting findings, i.e. more diverse trainings, irrespective of their origin (source, web, or synthetic) and label distribution (in or out-of-distribution), consistently enhances performance. This is an unfair comparison due to the greater costs of each augmented pretraining if epochs are normalized. This illustrates how alternative normalization can lead to wrong conclusions compared to compute normalization. DINO $\mathbb{B}$ epoch-normalized baseline is shown in text only (Acc. 41.14) for ease of visualization.

On Pretraining Data Diversity for Self-Supervised Learning

TL;DR

Abstract

On Pretraining Data Diversity for Self-Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (5)