Table of Contents
Fetching ...

ELFS: Label-Free Coreset Selection with Proxy Training Dynamics

Haizhong Zheng, Elisa Tsai, Yifu Lu, Jiachen Sun, Brian R. Bartoldson, Bhavya Kailkhura, Atul Prakash

TL;DR

ELFS tackles the high labeling cost in deep learning by introducing label-free coreset selection that leverages pseudo-labels from deep clustering to approximate training-dynamics-based data difficulty. A double-end pruning mechanism mitigates distribution shifts caused by pseudo-label noise, enabling effective subset selection without ground-truth labels. Across CIFAR-10/100, STL-10, and ImageNet-1K with various encoders (SwAV, DINO, CLIP), ELFS consistently outperforms existing label-free baselines and, in some cases, approaches supervised coreset performance. The approach offers practical data-efficiency gains, robust performance with noisy pseudo-labels, and demonstrates transferability across models and datasets, with code public on GitHub.

Abstract

High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based difficulty scores. In this paper, we introduce ELFS (Effective Label-Free Coreset Selection), a novel label-free coreset selection method. ELFS significantly improves label-free coreset selection by addressing two challenges: 1) ELFS utilizes deep clustering to estimate training dynamics-based data difficulty scores without ground truth labels; 2) Pseudo-labels introduce a distribution shift in the data difficulty scores, and we propose a simple but effective double-end pruning method to mitigate bias on calculated scores. We evaluate ELFS on four vision benchmarks and show that, given the same vision encoder, ELFS consistently outperforms SOTA label-free baselines. For instance, when using SwAV as the encoder, ELFS outperforms D2 by up to 10.2% in accuracy on ImageNet-1K. We make our code publicly available on GitHub.

ELFS: Label-Free Coreset Selection with Proxy Training Dynamics

TL;DR

ELFS tackles the high labeling cost in deep learning by introducing label-free coreset selection that leverages pseudo-labels from deep clustering to approximate training-dynamics-based data difficulty. A double-end pruning mechanism mitigates distribution shifts caused by pseudo-label noise, enabling effective subset selection without ground-truth labels. Across CIFAR-10/100, STL-10, and ImageNet-1K with various encoders (SwAV, DINO, CLIP), ELFS consistently outperforms existing label-free baselines and, in some cases, approaches supervised coreset performance. The approach offers practical data-efficiency gains, robust performance with noisy pseudo-labels, and demonstrates transferability across models and datasets, with code public on GitHub.

Abstract

High-quality human-annotated data is crucial for modern deep learning pipelines, yet the human annotation process is both costly and time-consuming. Given a constrained human labeling budget, selecting an informative and representative data subset for labeling can significantly reduce human annotation effort. Well-performing state-of-the-art (SOTA) coreset selection methods require ground truth labels over the whole dataset, failing to reduce the human labeling burden. Meanwhile, SOTA label-free coreset selection methods deliver inferior performance due to poor geometry-based difficulty scores. In this paper, we introduce ELFS (Effective Label-Free Coreset Selection), a novel label-free coreset selection method. ELFS significantly improves label-free coreset selection by addressing two challenges: 1) ELFS utilizes deep clustering to estimate training dynamics-based data difficulty scores without ground truth labels; 2) Pseudo-labels introduce a distribution shift in the data difficulty scores, and we propose a simple but effective double-end pruning method to mitigate bias on calculated scores. We evaluate ELFS on four vision benchmarks and show that, given the same vision encoder, ELFS consistently outperforms SOTA label-free baselines. For instance, when using SwAV as the encoder, ELFS outperforms D2 by up to 10.2% in accuracy on ImageNet-1K. We make our code publicly available on GitHub.
Paper Structure (39 sections, 3 equations, 10 figures, 12 tables)

This paper contains 39 sections, 3 equations, 10 figures, 12 tables.

Figures (10)

  • Figure 1: Label-free Coreset Selection Scheme. The goal of label-free coreset selection is to identify an informative and representative subset of the data without relying on ground truth labels, minimizing human annotation efforts in deep learning pipelines.
  • Figure 2: ELFS pipeline: (1) Training dynamics score estimation: ELFS begins with calculating image embeddings and nearest neighbors using a vision encoder and then assigns pseudo-labels to unlabeled data via deep clustering algorithms. The pseudo-labeled dataset is then used to compute training dynamics scores. (2) Coreset selection with proxy difficulty scores: With the pseudo-label-based scores, ELFS performs double-end pruning to select the unlabeled coreset. Subsequently, Humans annotate the selected coreset. This labeled coreset is used for later training.
  • Figure 3: Performance comparison of supervised CCS zheng2022coverage, label-free CCS, best label-free baseline (label-free D2 maharana2023d2), and our method ELFS on CIFAR100.
  • Figure 4: Ground truth label AUM distribution of different coreset on CIFAR100. The pruning rate for coreset is 50%. After applying double-end ( ELFS) pruning on data with pseudo-label-based scores, ELFS covers more hard data in the selected coreset when mapped onto ground truth AUM.
  • Figure 5: (a) reports the pseudo-label validation accuracy on different $\beta$ on CIFAR100. (b) reports the ground truth (oracle) label test accuracy on different $\beta$. Curves with different colors stand for different pruning rates. $\diamond$ indicates the $\beta$ selected by the best pseudo-label validation accuracy, which is equal to or close to the optimal $\beta$.
  • ...and 5 more figures