Table of Contents
Fetching ...

Large-scale Dataset Pruning with Dynamic Uncertainty

Muyang He, Shuo Yang, Tiejun Huang, Bo Zhao

TL;DR

The paper tackles the data and compute burden of training on large-scale datasets by pruning to an informative subset. It introduces Dynamic-Uncertainty (Dyn-Unc), which scores samples using prediction uncertainty computed over a sliding training window and aggregates across epochs to capture training dynamics. Empirically, Dyn-Unc achieves up to a 25% lossless pruning ratio on ImageNet-1K and ImageNet-21K with Swin Transformer, ConvNeXt, and ResNet, outperforming prior methods. The pruned data generalizes across architectures and improves out-of-distribution detection, highlighting practical gains in data efficiency for large-scale vision tasks.

Abstract

The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.

Large-scale Dataset Pruning with Dynamic Uncertainty

TL;DR

The paper tackles the data and compute burden of training on large-scale datasets by pruning to an informative subset. It introduces Dynamic-Uncertainty (Dyn-Unc), which scores samples using prediction uncertainty computed over a sliding training window and aggregates across epochs to capture training dynamics. Empirically, Dyn-Unc achieves up to a 25% lossless pruning ratio on ImageNet-1K and ImageNet-21K with Swin Transformer, ConvNeXt, and ResNet, outperforming prior methods. The pruned data generalizes across architectures and improves out-of-distribution detection, highlighting practical gains in data efficiency for large-scale vision tasks.

Abstract

The state of the art of many learning tasks, e.g., image classification, is advanced by collecting larger datasets and then training larger models on them. As the outcome, the increasing computational cost is becoming unaffordable. In this paper, we investigate how to prune the large-scale datasets, and thus produce an informative subset for training sophisticated deep models with negligible performance drop. We propose a simple yet effective dataset pruning method by exploring both the prediction uncertainty and training dynamics. We study dataset pruning by measuring the variation of predictions during the whole training process on large-scale datasets, i.e., ImageNet-1K and ImageNet-21K, and advanced models, i.e., Swin Transformer and ConvNeXt. Extensive experimental results indicate that our method outperforms the state of the art and achieves 25% lossless pruning ratio on both ImageNet-1K and ImageNet-21K. The code and pruned datasets are available at https://github.com/BAAI-DCAI/Dataset-Pruning.
Paper Structure (24 sections, 3 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 3 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Top-1 accuracy on ImageNet-1K-val
  • Figure 2: Top-1 accuracy on ImageNet-1K-ReaL
  • Figure 4: ImageNet-1K Train Set
  • Figure 5: Dyn-Unc (Ours) Pruned
  • Figure 6: Forgetting Pruned
  • ...and 9 more figures