Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints

Xiaobo Xia; Jiale Liu; Shaokun Zhang; Qingyun Wu; Hongxin Wei; Tongliang Liu

Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints

Xiaobo Xia, Jiale Liu, Shaokun Zhang, Qingyun Wu, Hongxin Wei, Tongliang Liu

TL;DR

The paper tackles refining coreset selection by minimizing subset size under model-performance constraints, framing the problem as lexicographic bilevel optimization over a binary mask $\bm{m}$. It introduces Lexicographic Bilevel Coreset Selection (LBCS) with an inner-loop training objective $f_1(\bm{m})$ and a secondary size objective $f_2(\bm{m})=\|\bm{m}\|_0$, solved via a black-box outer-loop optimizer (LexiFlow) guided by lexicographic relations. The authors prove $\epsilon$-convergence under reasonable conditions and demonstrate across datasets (Fashion-MNIST, SVHN, CIFAR-10, ImageNet-1k) that LBCS yields superior model performance with smaller coresets or better performance with the same coreset size, compared with multiple baselines. The work highlights practical implications for data efficiency, privacy-preserving data sharing, and energy savings, while also noting scalability considerations and potential applicability to broader vision tasks and large-scale pretraining.

Abstract

Coreset selection is powerful in reducing computational costs and accelerating data processing for deep learning algorithms. It strives to identify a small subset from large-scale data, so that training only on the subset practically performs on par with full data. Practitioners regularly desire to identify the smallest possible coreset in realistic scenes while maintaining comparable model performance, to minimize costs and maximize acceleration. Motivated by this desideratum, for the first time, we pose the problem of refined coreset selection, in which the minimal coreset size under model performance constraints is explored. Moreover, to address this problem, we propose an innovative method, which maintains optimization priority order over the model performance and coreset size, and efficiently optimizes them in the coreset selection procedure. Theoretically, we provide the convergence guarantee of the proposed method. Empirically, extensive experiments confirm its superiority compared with previous strategies, often yielding better model performance with smaller coreset sizes.

Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints

TL;DR

The paper tackles refining coreset selection by minimizing subset size under model-performance constraints, framing the problem as lexicographic bilevel optimization over a binary mask

. It introduces Lexicographic Bilevel Coreset Selection (LBCS) with an inner-loop training objective

and a secondary size objective

, solved via a black-box outer-loop optimizer (LexiFlow) guided by lexicographic relations. The authors prove

-convergence under reasonable conditions and demonstrate across datasets (Fashion-MNIST, SVHN, CIFAR-10, ImageNet-1k) that LBCS yields superior model performance with smaller coresets or better performance with the same coreset size, compared with multiple baselines. The work highlights practical implications for data efficiency, privacy-preserving data sharing, and energy savings, while also noting scalability considerations and potential applicability to broader vision tasks and large-scale pretraining.

Abstract

Paper Structure (38 sections, 1 theorem, 29 equations, 4 figures, 10 tables, 2 algorithms)

This paper contains 38 sections, 1 theorem, 29 equations, 4 figures, 10 tables, 2 algorithms.

Introduction
Contributions
Related Literature
Preliminaries
RCS Solutions are Non-trivial
Methodology
Lexicographic Bilevel Coreset Selection
Optimization Algorithm
Theoretical Analysis
Experiments
Preliminary Presentation of Algorithm's Superiority
Comparison with the Competitors
Robustness against Imperfect Supervision
Evaluations on ImageNet-1k
More Justifications and Analyses
...and 23 more sections

Key Result

Theorem 2

Under Condition condition:progressable and Condition condition:lower-bound (sufficient conditions), the algorithm is $\epsilon$-convergence in the RCS problem: where $\mathbbm{P}[f_{2}(\bm{m}^t) \leq f_2^*]$ represents the probability that the mask $\bm{m}^t$ generated at time $t$ is the converged solution as described above.

Figures (4)

Figure 1: Illustrations of phenomena of several trivial solutions discussed in §\ref{['sec:2.1']}. The experiment is based on zhou2022probabilistic. The setup is provided in Appendix \ref{['supp:exp_fig1']}. Here, $k$ denotes the predefined coreset size before optimization. (a)$f_1(\bm{m})$ vs. outer iterations with (\ref{['eq:cs_bi']}); (b)$f_2(\bm{m})$ vs. outer iterations with (\ref{['eq:cs_bi']}); (c)$f_1(\bm{m})$ vs. outer iterations with (\ref{['eq:cs_bi_mpo']}); (d)$f_2(\bm{m})$ vs. outer iterations with (\ref{['eq:cs_bi_mpo']}).
Figure 2: Illustrations of coreset selection under imperfect supervision. (a) Test accuracy (%) in coreset selection with 30% corrupted labels; (b) Test accuracy (%) in coreset selection with class-imbalanced data. The optimized coreset sizes by LBCS in these cases are provided in Appendix \ref{['supp:im_coreset_size']}.
Figure 3: The illustration of the average accuracy (%) brought by per data point within the selected coreset.
Figure 4: Illustrations of coreset selection with with 50% corrupted labels. The optimized coreset size by LBCS is provided in Appendix \ref{['supp:im_coreset_size']}.

Theorems & Definitions (3)

Definition 1: Lexicographic relations in RCS
Theorem 2: $\epsilon$-convergence
proof

Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints

TL;DR

Abstract

Refined Coreset Selection: Towards Minimal Coreset Size under Model Performance Constraints

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (3)