Effective pruning of web-scale datasets based on complexity of concept clusters

Amro Abbas; Evgenia Rusak; Kushal Tirumala; Wieland Brendel; Kamalika Chaudhuri; Ari S. Morcos

Effective pruning of web-scale datasets based on complexity of concept clusters

Amro Abbas, Evgenia Rusak, Kushal Tirumala, Wieland Brendel, Kamalika Chaudhuri, Ari S. Morcos

TL;DR

By filtering from the LAION dataset, it is found that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs and a simple and intuitive complexity measure is used to reduce the training cost to a quarter of regular training.

Abstract

Utilizing massive web-scale datasets has led to unprecedented performance gains in machine learning models, but also imposes outlandish compute requirements for their training. In order to improve training and data efficiency, we here push the limits of pruning large-scale multimodal datasets for training CLIP-style models. Today's most effective pruning method on ImageNet clusters data samples into separate concepts according to their embedding and prunes away the most prototypical samples. We scale this approach to LAION and improve it by noting that the pruning rate should be concept-specific and adapted to the complexity of the concept. Using a simple and intuitive complexity measure, we are able to reduce the training cost to a quarter of regular training. By filtering from the LAION dataset, we find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. More specifically, we are able to outperform the LAION-trained OpenCLIP-ViT-B32 model on ImageNet zero-shot accuracy by 1.1p.p. while only using 27.7% of the data and training compute. Despite a strong reduction in training cost, we also see improvements on ImageNet dist. shifts, retrieval tasks and VTAB. On the DataComp Medium benchmark, we achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.

Effective pruning of web-scale datasets based on complexity of concept clusters

TL;DR

Abstract

Paper Structure (44 sections, 3 equations, 8 figures, 9 tables)

This paper contains 44 sections, 3 equations, 8 figures, 9 tables.

Introduction
Related Work
Data curation in supervised learning
Contrastive Image-Language Pretraining
Data curation at scale
Redundancy Reduction
Matching Score Filtering
Improving the data quality
Methods
Deduplication.
CLIP-score filtering
Density-Based Pruning (DBP)
Experiment Design
Training Datasets.
Pruning the LAION dataset.
...and 29 more sections

Figures (8)

Figure 1: With our approach, we outperform training on the full LAION-400M dataset (64.1% vs 63.0%) for CLIP-ViT-B/32 models while significantly reducing the training cost to 27.7%. We filter from the LAION-CAT-440M by first deduplicating it to 277M examples using the SemDeDup method and then applying Density-Based Pruning (DBP) to get datasets of sizes 84M, 112M, and 166M examples.
Figure 2: We determine the complexity of concepts within a dataset by examining the clusters in the embedding space of a pretrained model. We characterize the clusters with their inter-cluster (left) and intra-cluster distance (middle). We find that clusters with small inter-cluster distance tend to show similar concepts and have low variability among each other. Further, we observe that dense clusters show higher similarity among their samples. Thus, to obtain a more diverse dataset with high variability and low redundancy, we need to sample more from clusters with high inter-cluster distance and high intra-cluster distance. The scatter plot (right) shows the distribution of $\mathrm{d_{intra}}$ over $\mathrm{d_{inter}}$ on LAION-50M for 500 clusters.
Figure 3: CLIP-ViT-B/32 zero-shot evaluation for filtering the LAION-CAT-440M dataset radenovic2023filtering. We filter the data by first deduplicating it to 277M examples to get LAION-DeDup-280M (SemDeDup in the Fig.). Then we apply the DBP method to filter the LAION-DeDup-280M dataset. We see that we outperform training on the whole LAION-CAT-440M dataset on ImageNet, VTAB, and ImageNet distribution shifts datasets while using only 27%-41% of the training cost. For the LAION-CAT-440M baseline (green line), we train for 12.7B examples seen during training following the OpenAI CLIP training procedure clip. For all other models, we train for 32 epochs regardless of the dataset size. The y-axis shows the training cost and the number of examples seen for each individual model. See Table \ref{['tab:datacomp_38_results_2']} for performance details on individual datasets.
Figure 4: (left) Performance grows consistently with continued training and we close the gap to training on the full LAION-50M dataset when training for 45 epochs, despite only using 30M samples. We also outperform the LAION CLIP-B/16 score (CS) filtering. (right) Density-based pruning (DBP) helps improve the performance over SSP-Pruning NEURIPS2022_7b75da9b. We prune the LAION-50M dataset to 30M examples and train CLIP-B/32 on it for five epochs.
Figure 5: The choice of the encoder as well as the data modality are important hyperparameters.
...and 3 more figures

Effective pruning of web-scale datasets based on complexity of concept clusters

TL;DR

Abstract

Effective pruning of web-scale datasets based on complexity of concept clusters

Authors

TL;DR

Abstract

Table of Contents

Figures (8)