Coverage-centric Coreset Selection for High Pruning Rates

Haizhong Zheng; Rui Liu; Fan Lai; Atul Prakash

Coverage-centric Coreset Selection for High Pruning Rates

Haizhong Zheng, Rui Liu, Fan Lai, Atul Prakash

TL;DR

The paper addresses the problem of selecting small training subsets that preserve accuracy when training under high pruning rates. It introduces a theoretical coverage framework by extending geometric set cover to a density-based distribution cover and a new metric AUC_pr to quantify how well a coreset covers the data distribution. Building on this, it proposes Coverage-centric Coreset Selection (CCS), a stratified, coverage-aware method that allocates sampling budgets across importance-score strata to improve data coverage, especially at high pruning rates. Empirical results across five datasets show CCS outperforms state-of-the-art methods and random sampling at high pruning rates while maintaining competitive performance at lower pruning rates, making CCS a strong new baseline for one-shot coreset selection.

Abstract

One-shot coreset selection aims to select a representative subset of the training data, given a pruning rate, that can later be used to train future models while retaining high accuracy. State-of-the-art coreset selection methods pick the highest importance examples based on an importance metric and are found to perform well at low pruning rates. However, at high pruning rates, they suffer from a catastrophic accuracy drop, performing worse than even random sampling. This paper explores the reasons behind this accuracy drop both theoretically and empirically. We first propose a novel metric to measure the coverage of a dataset on a specific distribution by extending the classical geometric set cover problem to a distribution cover problem. This metric helps explain why coresets selected by SOTA methods at high pruning rates perform poorly compared to random sampling because of worse data coverage. We then propose a novel one-shot coreset selection method, Coverage-centric Coreset Selection (CCS), that jointly considers overall data coverage upon a distribution as well as the importance of each example. We evaluate CCS on five datasets and show that, at high pruning rates (e.g., 90%), it achieves significantly better accuracy than previous SOTA methods (e.g., at least 19.56% higher on CIFAR10) as well as random selection (e.g., 7.04% higher on CIFAR10) and comparable accuracy at low pruning rates. We make our code publicly available at https://github.com/haizhongzheng/Coverage-centric-coreset-selection.

Coverage-centric Coreset Selection for High Pruning Rates

TL;DR

Abstract

Paper Structure (24 sections, 3 theorems, 17 equations, 7 figures, 10 tables, 1 algorithm)

This paper contains 24 sections, 3 theorems, 17 equations, 7 figures, 10 tables, 1 algorithm.

Introduction
Preliminaries
One-shot Coreset selection
Catastrophic accuracy drop with high pruning rates
The coverage of coresets
Density-based partial coverage
Coverage analysis on coresets
Methodology: Coverage-centric coreset selection
Evaluation
Coreset performance comparison
Ablation study & analysis
Related work
Conclusion
Overview
Detailed Experimental setting
...and 9 more sections

Key Result

Theorem 1

Given n i.i.d. samples drawn from $P_\mu$ as $\mathcal{S}$ = $\{{\bm{x}}_i, y_i\}_{i\in[n]}$ where $y_i \in [C]$ is the class label for example $x_i$, a coreset $\mathcal{S'}$ which is a $p$-partial $r$-cover for $P_\mu$ on the input space $X$, and an $\epsilon > 1-p$, if the loss function $l(\cdot,

Figures (7)

Figure 1: Existing coreset solutions have better accuracy than random sampling at low pruning rates, but perform worse at high pruning rates.
Figure 2: The $p$-$r$ curves of different subsets of CIFAR10 training dataset. The dataset without any pruning has the lowest curve (blue). Forgetting curve (green) and random curve (orange) have a crossover around $p=80\%$.
Figure 3: The relationship between AUC$_{pr}$ and accuracy. Different colors stands for different coreset selected methods. A larger circle size indicates a higher pruning ratio on the dataset. Smaller AUC$_{pr}$ often ends up with better accuracy. With high pruning rates (larger balls), forgetting and AUM have a higher AUC$_{pr}$ and worse accuracy.
Figure 4: CIFAR10 data distribution with AUM as the importance metric (lower AUM values are more important). At 70% pruning rate, SOTA method selects data from the red region, since it prune low-importance (easy) examples first. Our method, CCS, selects data from the green region using stratified sampling across the importance metric, along with pruning hardest examples on the left. The coreset selected by the SOTA method lacks "easy" examples on the right in the high-density region.
Figure 5: Performance comparison between our proposed method (CCS) and other baselines on CIFAR10, CIFAR100, and ImageNet-1k. The pruning rate is the fraction of examples removed from the original datasets. The evaluation results show that our method achieves better performance than all other baselines at high pruning rates (e.g., $70\%$, $80\%$, $90\%$) and comparable performance at low pruning rates (e.g., $30\%$, $50\%$). We also present detailed numerical numbers in Appendix \ref{['sec:app-eval']}.
...and 2 more figures

Theorems & Definitions (6)

Definition 1: $p$-partial $r$-cover
Theorem 1
Proposition 1
Lemma 1
proof
proof

Coverage-centric Coreset Selection for High Pruning Rates

TL;DR

Abstract

Coverage-centric Coreset Selection for High Pruning Rates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (6)