Table of Contents
Fetching ...

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

Shuo Yang, Zeke Xie, Hanyu Peng, Min Xu, Mingming Sun, Ping Li

TL;DR

This work tackles data redundancy in deep learning by introducing dataset pruning, an optimization-based subset selection method that bounds generalization error through influence-function estimates of parameter change. It formulates a discrete optimization to identify the largest epsilon-redundant subset, ensuring the pruned data induces a bounded generalization gap. The approach is theoretically grounded, with a bound showing the generalization gap scales with ε/n and pruning size, and empirically validated to achieve substantial pruning (up to 40% on CIFAR-10) and near-halving of training time while preserving accuracy. The method also demonstrates cross-architecture generalization and NAS utility by using small-pruned datasets to guide larger models and search spaces.

Abstract

The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.

Dataset Pruning: Reducing Training Data by Examining Generalization Influence

TL;DR

This work tackles data redundancy in deep learning by introducing dataset pruning, an optimization-based subset selection method that bounds generalization error through influence-function estimates of parameter change. It formulates a discrete optimization to identify the largest epsilon-redundant subset, ensuring the pruned data induces a bounded generalization gap. The approach is theoretically grounded, with a bound showing the generalization gap scales with ε/n and pruning size, and empirically validated to achieve substantial pruning (up to 40% on CIFAR-10) and near-halving of training time while preserving accuracy. The method also demonstrates cross-architecture generalization and NAS utility by using small-pruned datasets to guide larger models and search spaces.

Abstract

The great success of deep learning heavily relies on increasingly larger training data, which comes at a price of huge computational and infrastructural costs. This poses crucial questions that, do all training data contribute to model's performance? How much does each individual training sample or a sub-training-set affect the model's generalization, and how to construct the smallest subset from the entire training data as a proxy training set without significantly sacrificing the model's performance? To answer these, we propose dataset pruning, an optimization-based sample selection method that can (1) examine the influence of removing a particular set of training samples on model's generalization ability with theoretical guarantee, and (2) construct the smallest subset of training data that yields strictly constrained generalization gap. The empirically observed generalization gap of dataset pruning is substantially consistent with our theoretical expectations. Furthermore, the proposed method prunes 40% training examples on the CIFAR-10 dataset, halves the convergence time with only 1.3% test accuracy decrease, which is superior to previous score-based sample selection methods.
Paper Structure (14 sections, 1 theorem, 10 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 14 sections, 1 theorem, 10 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Suppose that the original dataset is $\mathcal{D}$ and the pruned dataset is $\hat{\mathcal{D}} = \{ \hat{z}_{i} \}_{i=1}^{m}$. If $\left\| \sum_{\hat{z}_{i} \in \hat{\mathcal{D}}} \mathcal{I}_{\mathrm{param}}(\hat{z}_{i}) \right\|_2 \leq \epsilon$, we have the upper bound of the generalization ga

Figures (4)

  • Figure 1: We compare our proposed optimization-based dataset pruning method with several sample-selection baselines. Our optimization-based pruning method considers the 'group effect' of pruned examples and exhibits superior performance, especially when the pruning ratio is high.
  • Figure 2: The comparison of empirically observed generalization gap and our theoretical expectation in Theorem. \ref{['theorem:bound_loss']}. We ignore the term of $m / n^{2}$ since it has much smaller magnitude with $\epsilon /n$.
  • Figure 3: To evaluate the unseen-architecture generalization of the pruned dataset, we prune the CIFAR10 dataset using a relatively small network and then train a larger network on the pruned dataset. We consider three networks from two families with different parameter complexity, SENet (1.25M parameters), ResNet18 (11.69M parameters), and ResNet50 (25.56M parameters). The results indicate that the dataset pruned by small networks can generalize well to large networks.
  • Figure 4: Dataset pruning significantly improves the training efficiency with minor performance scarification. When pruning 40% training examples, the convergence time is nearly halved with only 1.3% test accuracy drop. The pruned dataset can be used to tune hyper-parameters and network architectures to reduce the searching time.

Theorems & Definitions (3)

  • Definition 1: $\epsilon$-redundant subset.
  • Theorem 1: Generalization Gap of Dataset Pruning
  • proof