Table of Contents
Fetching ...

Exploring Learning Complexity for Efficient Downstream Dataset Pruning

Wenyu Jiang, Zhenlong Liu, Zejian Xie, Songxin Zhang, Bingyi Jing, Hongxin Wei

TL;DR

This work tackles the high cost of fine-tuning large pre-trained models by proposing a training-free approach to prune downstream data. It introduces Distorting-based Learning Complexity (DLC), which estimates sample hardness by masking weights to create a learning path and using Monte Carlo estimation, and complements it with FlexRand, a randomized under-sampling strategy that adapts to data regimes and mitigates distribution shift. The method demonstrates state-of-the-art performance and dramatic pruning-time reductions (notably about 35x faster in vision benchmarks) across diverse image and instruction datasets, including LLM fine-tuning tasks. The results suggest a practical path toward efficient downstream adaptation that leverages pre-trained representations without backpropagation, while highlighting robustness to pretraining quality and noting limitations when representations are weakly learned.

Abstract

The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.

Exploring Learning Complexity for Efficient Downstream Dataset Pruning

TL;DR

This work tackles the high cost of fine-tuning large pre-trained models by proposing a training-free approach to prune downstream data. It introduces Distorting-based Learning Complexity (DLC), which estimates sample hardness by masking weights to create a learning path and using Monte Carlo estimation, and complements it with FlexRand, a randomized under-sampling strategy that adapts to data regimes and mitigates distribution shift. The method demonstrates state-of-the-art performance and dramatic pruning-time reductions (notably about 35x faster in vision benchmarks) across diverse image and instruction datasets, including LLM fine-tuning tasks. The results suggest a practical path toward efficient downstream adaptation that leverages pre-trained representations without backpropagation, while highlighting robustness to pretraining quality and noting limitations when representations are weakly learned.

Abstract

The ever-increasing fine-tuning cost of large-scale pre-trained models gives rise to the importance of dataset pruning, which aims to reduce dataset size while maintaining task performance. However, existing dataset pruning methods require training on the entire dataset, which is impractical for large-scale pre-trained models. In this paper, we propose a straightforward, novel, and training-free hardness score named Distorting-based Learning Complexity (DLC), to identify informative images and instructions from the downstream dataset efficiently. Our method is motivated by the observation that easy samples learned faster can also be learned with fewer parameters. Specifically, we define the Learning Complexity to quantify sample hardness and utilize a lightweight weights masking process for fast estimation, instead of the costly SGD optimization. Based on DLC, we further design a flexible under-sampling with randomness (dubbed FlexRand), replacing the top-K strategy, to alleviate the severe subset distribution shift. Extensive experiments with downstream image and instructions dataset pruning benchmarks demonstrate the effectiveness and efficiency of the proposed approach. In the images pruning benchmark, DLC significantly reduces the pruning time by 35x while establishing state-of-the-art performance with FlexRand.
Paper Structure (56 sections, 12 equations, 7 figures, 15 tables)

This paper contains 56 sections, 12 equations, 7 figures, 15 tables.

Figures (7)

  • Figure 1: Performance comparison of different dataset pruning methods. (a): Time for downstream dataset pruning. The costs of existing training-based methods are expensive, but we achieve 35$\times$ speed up. (b): Accuracy on the downstream task. Our method outperforms the random baseline and achieves state-of-the-art performance. More results can be found in Section \ref{['sec:exp']}.
  • Figure 2: Ranking correlation between the loss integral over the optimization and masking process. We fine-tune the pre-trained ResNet-18 on the five downstream datasets for 50 epochs and present the CXRB10 (Nodule) results here due to the space limit (see Figure \ref{['fig:appendix-score']} for full results). The easy (hard) sample corresponds to the data point with a minimum (maximum) loss integral over the optimization. We utilize Max-Min normalization to scale the loss to the range of (0, 1). (a): Loss trends with the number of parameters, i.e., model capacity. We produce models with different capacities via masking {2%, ..., 98%} weights of the pre-trained ResNet-18. Average category representation acts as the prototype for classification and loss integral. (b): Loss trends with the optimization time. (c): High ranking correlation coefficient with $\rho = \mathbf{0.54}$.
  • Figure 3: Samples distinguished by DLC. In detail, we randomly select five image pairs with the highest and lowest DLC scores from all downstream Dataset (Category) sets. Qualitatively, easy samples with the lowest DLC contain a full and clear structure for classification.
  • Figure 4: Harmful distribution shift from the top-K under-sampling strategies. We conduct downstream dataset pruning with four under-sampling strategies according to the optimization-based learning complexity and DLC. (a): Distribution shift comparison of four subsets for optimization-based learning complexity. (b): Downstream performance comparison of four subsets for optimization-based learning complexity. (c): Distribution shift comparison of four subsets for DLC. (d): Downstream performance comparison of four subsets for DLC.
  • Figure 5: Trends of average accuracy gap over the random method at different pruning ratios. Detailed results can be found in Appendix \ref{['sec:app-main']}.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Definition 1: Learning Path
  • Definition 2: Learning Complexity