Table of Contents
Fetching ...

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

Benjamin Feuer, Jiawei Xu, Niv Cohen, Patrick Yubeaton, Govind Mittal, Chinmay Hegde

TL;DR

This work takes steps towards a formal evaluation of data curation strategies and introduces SELECT, the first large-scale benchmark of curation strategies for image classification, and creates a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date.

Abstract

Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at https://github.com/jimmyxu123/SELECT.

SELECT: A Large-Scale Benchmark of Data Curation Strategies for Image Classification

TL;DR

This work takes steps towards a formal evaluation of data curation strategies and introduces SELECT, the first large-scale benchmark of curation strategies for image classification, and creates a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date.

Abstract

Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at https://github.com/jimmyxu123/SELECT.
Paper Structure (30 sections, 5 equations, 4 figures, 9 tables)

This paper contains 30 sections, 5 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: Overview of Select Benchmark. (Left) The ImageNet++ dataset is composed of different shifts of the ImageNet train set. The shifts were generated using different curation strategies and drawn from diverse data sources including OpenImages (natural images), LAION (natural images), and Stable Diffusion (synthetic images). (Right) We trained identical models on the sets collected using the different strategies (producing different 'shifts'), and evaluated them in two ways: (i) Utility metrics: quantifying the models ability to predict different in-distribution and out-of-distribution test sets, and (ii) Analytic metrics: examining various statistics of the distribution of the samples among the various classes.
  • Figure 2: Samples selected from the ImageNet++ dataset compared to those in the original ImageNet-1k dataset. The selected classes are "Volcano", "School Bus", "Umbrella" and "Dogsled". Different viewpoints and centers emerge in these categories of LAION-1k and OI-1k. Also, samples generated in SD-1k are illustrations which may defy the laws of physics.
  • Figure 3: Illustration of Label Imbalance metrics on a dataset with 1.2 million samples and 1000 classes. (Left) is an approximately uniform data distribution. (Middle) 5% classes hold 50% samples. (Right) Most classes are under represented with less than 100 samples (y-axis is log-scaled).
  • Figure 4: The performance of various examined data curation strategies across different ImageNet evaluation sets. Each color on the radial plot represents a different data curation strategy (Tab.\ref{['tab:curatestrat']}). Each direction on the plot corresponds to a distinct ImageNet evaluation set (see Sec. \ref{['sec:select']}). All values are normalized based on the performance achieved with the original ImageNet training set, set as $1.0$. The radial direction indicates values ranging from $0.0$ to $0.9$.