Zero-Shot Coreset Selection via Iterative Subspace Sampling
Brent A. Griffin, Jacob Marks, Jason J. Corso
TL;DR
The paper tackles the high cost of training on large labeled datasets by proposing a zero-shot, unlabeled coreset selection method. ZCore leverages foundation-model embeddings to create a high-dimensional representation of unlabeled data, then iteratively samples low-dimensional embedding subspaces to assess coverage and penalize redundancy, producing a coreset without any data labeling or prior training. Across CIFAR10/100, ImageNet, and EuroSAT, ZCore achieves competitive or superior performance relative to state-of-the-art label-based methods, particularly at low data rates, while offering substantial runtime and labeling cost savings. This approach broadens the practicality of data-efficient deep learning, scales to real-world unlabeled data, and remains robust across diverse datasets and model architectures.
Abstract
Deep learning increasingly relies on massive data with substantial storage, annotation, and training costs. To reduce costs, coreset selection finds a representative subset of data to train models while ideally performing on par with the full data training. To maximize performance, current state-of-the-art coreset methods select data using dataset-specific ground truth labels and training. However, these methodological requirements prevent selection at scale on real-world, unlabeled data. To that end, this paper addresses the selection of coresets that achieve state-of-the-art performance but without using any labels or training on candidate data. Instead, our solution, Zero-Shot Coreset Selection via Iterative Subspace Sampling (ZCore), uses previously-trained foundation models to generate zero-shot, high-dimensional embedding spaces to interpret unlabeled data. ZCore then iteratively quantifies the relative value of all candidate data based on coverage and redundancy in numerous subspace distributions. Finally, ZCore selects a coreset sized for any data budget to train downstream models. We evaluate ZCore on four datasets and outperform several state-of-the-art label-based methods, especially at low data rates that provide the most substantial cost reduction. On ImageNet, ZCore selections for 10% training data achieve a downstream validation accuracy of 53.99%, which outperforms prior label-based methods and removes annotation and training costs for 1.15 million images. Our paper's code is publicly available at https://github.com/voxel51/zcore.
