Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Huy V. Vo; Vasil Khalidov; Timothée Darcet; Théo Moutakanni; Nikita Smetanin; Marc Szafraniec; Hugo Touvron; Camille Couprie; Maxime Oquab; Armand Joulin; Hervé Jégou; Patrick Labatut; Piotr Bojanowski

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Huy V. Vo, Vasil Khalidov, Timothée Darcet, Théo Moutakanni, Nikita Smetanin, Marc Szafraniec, Hugo Touvron, Camille Couprie, Maxime Oquab, Armand Joulin, Hervé Jégou, Patrick Labatut, Piotr Bojanowski

TL;DR

This work tackles SSL data quality by introducing a principled automatic data curation pipeline based on hierarchical $k$-means with resampling, aimed at producing large, diverse, and balanced datasets. The method promotes sampling from an embedding-space support to approximate a uniform distribution over data concepts, addressing long-tail biases that hinder SSL. Across web images, text, and satellite imagery, SSL features trained on curated data outperform those trained on raw data and, in many cases, rival manually curated datasets, with pronounced gains on robustness, out-of-distribution, and long-tailed benchmarks. The approach is scalable, domain-agnostic, and applicable beyond SSL, offering practical benefits for large-scale data-driven learning while highlighting areas for further scaling and fairness considerations.

Abstract

Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

TL;DR

This work tackles SSL data quality by introducing a principled automatic data curation pipeline based on hierarchical

-means with resampling, aimed at producing large, diverse, and balanced datasets. The method promotes sampling from an embedding-space support to approximate a uniform distribution over data concepts, addressing long-tail biases that hinder SSL. Across web images, text, and satellite imagery, SSL features trained on curated data outperform those trained on raw data and, in many cases, rival manually curated datasets, with pronounced gains on robustness, out-of-distribution, and long-tailed benchmarks. The approach is scalable, domain-agnostic, and applicable beyond SSL, offering practical benefits for large-scale data-driven learning while highlighting areas for further scaling and fairness considerations.

Abstract

-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.

Paper Structure (31 sections, 1 theorem, 2 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 31 sections, 1 theorem, 2 equations, 5 figures, 7 tables, 1 algorithm.

Introduction
Related work
Approach
A Criterion for Creating Pre-training Datasets
Problem statement.
Rebalancing datasets with k-means
Rebalancing datasets with hierarchical k-means
Sampling from hierarchical $k$-means.
Choice of numbers of clusters.
Experiments
Experiments on simulated data
Self-supervised learning on web-based images
Training data, implementation details, and evaluations
Ablation Study
Does hierarchical $k$-means lead to more balanced clusterings?
...and 16 more sections

Key Result

Lemma 1

Let $P$ be the probability distribution with density function $p$, $t$ a scalar in $(0, 1)$, $Q$ the probability distribution with density $q=\frac{1}{Z} p^t$ where $Z=\int p^t$, and $U$ the uniform probability distribution over the support $\Omega$ of $P$ with density $u = \frac{1}{vol(\Omega)} \te where $D_\text{KL}$ denotes Kullback-Leibler divergence. Furthermore, equality happens if and only

Figures (5)

Figure 1: An overview of the data curation pipeline. Large data pool often exhibits a long-tailed distribution of concepts. On web-based images repositories, concepts such as website or dog are much more present than plunger. We apply hierarchical $k$-means to obtain clusters that spread uniformly over the concepts. Data points are then sampled from the clusters to form a curated dataset that has a better balance of concepts.
Figure 2: Normalized histograms of centroids computed by $k$-means with $d(x,y) = \|x-y\|^s$ for different values of $s$. The vanilla $k$-means centroids ($s=2$) approximately follow the theoretical Panter and Dite formula with un-normalized density $p^{1/3}$panter1951quantization with $p$ is the data distribution's density. Larger values of $s$ result in flatter distributions of centroids.
Figure 3: A visualization of clusters obtained with different clustering methods on simulated 2-dimensional data. (a-b) Voronoi diagrams and KDEs computed on the 2-D simulated data and the centroids of clusters obtained with $k$-means, DBSCAN ester1996dbscan, Agglomerative clustering Sibson1973slink and several variants of hierarchical $k$-means. For hierarchical $k$-means, centroids spread more uniformly with more levels and resampling steps. (c) Estimated Kullback-Leibler divergence between the uniform distribution on $\Omega = [-3,3] \times [-3,3]$ and the KDEs computed from the centroids.
Figure 4: An investigation on the distribution of clusters of web-based images over the classes of ImageNet. The clusters are obtained with variants of hierarchical $k$-means on our data pool. For each clustering, we first assign clusters to ImageNet classes with $k$-nn, then estimate for each class their size, the number and the average size of the corresponding clusters. We show the classes' size against the number of corresponding clusters in the first row, and against the average cluster size in the second row. The straight lines that best fit the scatter points are shown in yellow. We observe that $k$-means tends to break down larger classes into more small clusters while hierarchical $k$-means with multiple levels forms fewer but larger clusters for large classes. This way, it distributes the clusters more equally among classes, regardless of their size, and enables sampling more balanced dataset from the data pool.
Figure 5: Hierarchy of clusters obtained when applying our proposed hierarchical $k$-means on web-based images. We show here clusters in levels $1$, $2$ and $3$ representing food, motorbike and bedroom concepts. Red rectangles show clusters in level 1 with $2$ representative images.

Theorems & Definitions (1)

Lemma 1

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

TL;DR

Abstract

Automatic Data Curation for Self-Supervised Learning: A Clustering-Based Approach

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (1)