Table of Contents
Fetching ...

BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges

Hoyong Choi, Nohyun Ki, Hye Won Chung

TL;DR

This work tackles data subset selection for neural network training across a broad spectrum of selection ratios, a regime where prior methods struggle to remain competitive. It proposes Best Window Selection (BWS), which orders samples by a difficulty-based score and searches contiguous window subsets, scoring each window with a kernel ridge regression proxy trained on model-derived features. Across CIFAR-10/100 and ImageNet, BWS consistently outperforms both score-based and optimization-based baselines over ratios from 1% to 90%, approaches the Oracle window, and remains robust to label noise and cross-architecture variations. The method is computationally efficient, leveraging a simple proxy task and avoiding expensive full-model evaluations, which makes it appealing for practical data pruning in large-scale settings.

Abstract

Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training, addressing challenges associated with training neural networks on large-scale datasets. However, existing methods tend to specialize in either high or low selection ratio regimes, lacking a universal approach that consistently achieves competitive performance across a broad range of selection ratios. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores. This approach offers flexibility by allowing the choice of window intervals that span from easy to difficult samples. Furthermore, we provide an efficient mechanism for selecting the best window subset by evaluating its quality using kernel ridge regression. Our experimental results demonstrate the superior performance of BWS compared to other baselines across a broad range of selection ratios over datasets, including CIFAR-10/100 and ImageNet, and the scenarios involving training from random initialization or fine-tuning of pre-trained models.

BWS: Best Window Selection Based on Sample Scores for Data Pruning across Broad Ranges

TL;DR

This work tackles data subset selection for neural network training across a broad spectrum of selection ratios, a regime where prior methods struggle to remain competitive. It proposes Best Window Selection (BWS), which orders samples by a difficulty-based score and searches contiguous window subsets, scoring each window with a kernel ridge regression proxy trained on model-derived features. Across CIFAR-10/100 and ImageNet, BWS consistently outperforms both score-based and optimization-based baselines over ratios from 1% to 90%, approaches the Oracle window, and remains robust to label noise and cross-architecture variations. The method is computationally efficient, leveraging a simple proxy task and avoiding expensive full-model evaluations, which makes it appealing for practical data pruning in large-scale settings.

Abstract

Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training, addressing challenges associated with training neural networks on large-scale datasets. However, existing methods tend to specialize in either high or low selection ratio regimes, lacking a universal approach that consistently achieves competitive performance across a broad range of selection ratios. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores. This approach offers flexibility by allowing the choice of window intervals that span from easy to difficult samples. Furthermore, we provide an efficient mechanism for selecting the best window subset by evaluating its quality using kernel ridge regression. Our experimental results demonstrate the superior performance of BWS compared to other baselines across a broad range of selection ratios over datasets, including CIFAR-10/100 and ImageNet, and the scenarios involving training from random initialization or fine-tuning of pre-trained models.
Paper Structure (46 sections, 7 theorems, 18 equations, 12 figures, 25 tables, 1 algorithm)

This paper contains 46 sections, 7 theorems, 18 equations, 12 figures, 25 tables, 1 algorithm.

Key Result

Theorem 1

If the subset size is as small as $|{\mathbf{S}}|=m\ll \sqrt{d/\ln d}$, then the first coordinate of ${\mathbf{w}}_{{\mathbf{S}}}$ is approximated as $({\mathbf{w}}_{{\mathbf{S}}})_1 \approx \sum_{i=1}^m|({\mathbf{x}}_i)_1|$. On the other hand, if $|{\mathbf{S}}|=m\gg d^2\ln{d}$, it can be approxim

Figures (12)

  • Figure 1: Overview of the proposed method, Best Window Selection (BWS). BWS is composed of two parts, 1) generating window subsets and 2) evaluating window subsets. We first sort samples by a difficulty score (e.g., Forgetting forgetting) and generate window subsets of a fixed size while varying their starting points. We then evaluate the window subsets, by solving kernel ridge regression on the input features of each window subset and obtaining simple (linear) classifiers associated with each window subset. Finally, we evaluate the performance of these classifiers on the full training dataset to identify the best window subset achieving the highest accuracy.
  • Figure 2: Results on "training set split" experiment on CIFAR-10 dataset, when five different models are trained by five different data subsets, divided by their difficulty rankings, $[0,20]\%$ (hardest) to $[80,100]\%$ (easiest). Model accuracies ($y$-axis) are evaluated on all five subsets ($x$-axis) separately. Right figures visualize the t-SNE of test samples' features extracted from models trained by the hardest $[0,20]\%$ subset (top) and the $[20,40]\%$ subset (bottom).
  • Figure 3: Sliding window experiments to measure the test accuracy of the models trained by window subsets while changing the starting point of the windows in CIFAR-10 (left) and CIFAR-100 (right) dataset. Samples are sorted in descending order by their difficulty scores. The horizontal lines are results from random selection. For each subset ratio, there exists the best window, and its starting point shifts toward left as the subset ratio increases. Results for ImageNet dataset is also reported in Appendix \ref{['sec:sliding_window_app']}
  • Figure 4: (a, b, c) Data pruning experiments. Test accuracy of the models trained with data subsets of varying ratios in CIFAR-10/100, and ImageNet dataset, selected by different methods. Our method (BWS) outperforms other baselines across a wide range of selection ratios and achieves the accuracy as high as the Oracle window. Full results are reported in Table \ref{['tab:CIFAR10_acc']}--\ref{['tab:ImageNet_acc']}.
  • Figure 5: (a) Cross architecture experiment. Test accuracy of the model fine-tuned with subsets of varying ratios in the CIFAR-10 dataset, selected by different methods. We utilize the Vision Transformer (ViT) architecture, pretrained on the ImageNet dataset. (b) Robustness to label noise. Data pruning experiments with CIFAR-10, including 20% label-noise. For both experiments, BWS surpasses other baselines for a wide range of selection ratios.
  • ...and 7 more figures

Theorems & Definitions (9)

  • Theorem 1: Informal
  • Lemma 1.1: Chi-square tail bound
  • Lemma 1.2: Gaussian tail bound
  • Lemma 1.3: Gershgorin circle theorem
  • Theorem 2: Sample-deficient regime
  • proof
  • Theorem 3: Sample-sufficient regime
  • proof
  • Theorem 4: Informal version of Equivalence_NN_KRR