On Pretraining Data Diversity for Self-Supervised Learning
Hasan Abed Al Kader Hammoud, Tuhin Das, Fabio Pizzati, Philip Torr, Adel Bibi, Bernard Ghanem
TL;DR
The paper investigates how pretraining data diversity affects self-supervised learning under a fixed compute budget, revealing that diversity boosts performance primarily when the downstream task distribution closely matches the pretraining distribution. Using a compute-normalized framework with budget $\mathcal{C}=N\cdot\mathcal{E}$ and diversity $\mathcal{D}=N/\mathcal{C}$, the authors show that increasing diversity helps for in-distribution transfer, but cannot fully bridge distribution shifts, even with extremely large diverse datasets like YFCC100M. Large-scale results indicate ImageNet-pretrained SSL generally outperforms YFCC100M under the normalized budget, with distribution-distance metrics (FID, VisualDNA) aligning with transfer outcomes; scaling data alone does not compensate for domain gaps. The work emphasizes the need for compute-normalized evaluations and distribution-aware data collection when evaluating SSL, and points to a path where future SSL methods must better leverage diversity to generalize across shifting downstream distributions.
Abstract
We explore the impact of training with more diverse datasets, characterized by the number of unique samples, on the performance of self-supervised learning (SSL) under a fixed computational budget. Our findings consistently demonstrate that increasing pretraining data diversity enhances SSL performance, albeit only when the distribution distance to the downstream data is minimal. Notably, even with an exceptionally large pretraining data diversity achieved through methods like web crawling or diffusion-generated data, among other ways, the distribution shift remains a challenge. Our experiments are comprehensive with seven SSL methods using large-scale datasets such as ImageNet and YFCC100M amounting to over 200 GPU days. Code and trained models are available at https://github.com/hammoudhasan/DiversitySSL
