Table of Contents
Fetching ...

Zero-Shot Coreset Selection via Iterative Subspace Sampling

Brent A. Griffin, Jacob Marks, Jason J. Corso

TL;DR

The paper tackles the high cost of training on large labeled datasets by proposing a zero-shot, unlabeled coreset selection method. ZCore leverages foundation-model embeddings to create a high-dimensional representation of unlabeled data, then iteratively samples low-dimensional embedding subspaces to assess coverage and penalize redundancy, producing a coreset without any data labeling or prior training. Across CIFAR10/100, ImageNet, and EuroSAT, ZCore achieves competitive or superior performance relative to state-of-the-art label-based methods, particularly at low data rates, while offering substantial runtime and labeling cost savings. This approach broadens the practicality of data-efficient deep learning, scales to real-world unlabeled data, and remains robust across diverse datasets and model architectures.

Abstract

Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.

Zero-Shot Coreset Selection via Iterative Subspace Sampling

TL;DR

The paper tackles the high cost of training on large labeled datasets by proposing a zero-shot, unlabeled coreset selection method. ZCore leverages foundation-model embeddings to create a high-dimensional representation of unlabeled data, then iteratively samples low-dimensional embedding subspaces to assess coverage and penalize redundancy, producing a coreset without any data labeling or prior training. Across CIFAR10/100, ImageNet, and EuroSAT, ZCore achieves competitive or superior performance relative to state-of-the-art label-based methods, particularly at low data rates, while offering substantial runtime and labeling cost savings. This approach broadens the practicality of data-efficient deep learning, scales to real-world unlabeled data, and remains robust across diverse datasets and model architectures.

Abstract

Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.

Paper Structure

This paper contains 15 sections, 9 equations, 17 figures, 10 tables.

Figures (17)

  • Figure 1: Overview. We use foundation models to generate high-dimensional embeddings for unlabeled candidate images (left). We then iteratively slice the full embedding space into subspaces (center), which we sample to find examples covering large regions of the resulting subspace distributions while penalizing redundancy (right). Finally, we output a coreset of data to train models for a given budget (bottom).
  • Figure 2: Coreset and Model Train Workflow Comparison.
  • Figure 3: Comparison of embeddings and sampling techniques. ResNet18 (left) and CLIP (right) are the first-dimension model embeddings for 50,000 CIFAR100 train set examples, while each corresponding distribution type is sampled 50,000 times.
  • Figure 4: ZCore Coreset Rank Visualization for CIFAR100. Model embeddings and 2D approximation of the full embedding space generated using the FiftyOne Library and UMAP moore2020fiftyonemcinnes2018umap-software.
  • Figure 5: Comparison of coreset selection methods using downstream model validation on CIFAR10, CIFAR100, and ImageNet. Method results with solid lines select coreset data using labels and training, method results with dotted lines select coreset data using self-supervised training, and method results with dashed lines select coreset data without labels or training. The $x$-axis is in log scale for the number of coreset examples used for model training. Corresponding result tables for each dataset are provided in the Supplementary Material.
  • ...and 12 more figures