Scale Efficient Training for Large Datasets
Qing Zhou, Junyu Gao, Qi Wang
TL;DR
The paper tackles the inefficiency of training on large-scale datasets caused by low-value samples and proposes Scale Efficient Training (SeTa), a dynamic sample-pruning framework. SeTa uses loss-guided clustering on a down-sampled subset to create difficulty-stratified groups and a sliding-window curriculum that progressively exposes easier to harder clusters, with partial annealing in later epochs to stabilize training. Empirical results across synthetic (ToCa, SS1M, ST+MJ) and real datasets—covering diverse architectures and tasks—show substantial training time reductions (up to 50%) while preserving or improving performance, demonstrating strong generalization and practical utility. The approach is model-agnostic and easily integrated into existing pipelines, offering a scalable solution to the data-efficiency challenge in deep learning.
Abstract
The rapid growth of dataset scales has been a key driver in advancing deep learning research. However, as dataset scale increases, the training process becomes increasingly inefficient due to the presence of low-value samples, including excessive redundant samples, overly challenging samples, and inefficient easy samples that contribute little to model improvement.To address this challenge, we propose Scale Efficient Training (SeTa) for large datasets, a dynamic sample pruning approach that losslessly reduces training time. To remove low-value samples, SeTa first performs random pruning to eliminate redundant samples, then clusters the remaining samples according to their learning difficulty measured by loss. Building upon this clustering, a sliding window strategy is employed to progressively remove both overly challenging and inefficient easy clusters following an easy-to-hard curriculum.We conduct extensive experiments on large-scale synthetic datasets, including ToCa, SS1M, and ST+MJ, each containing over 3 million samples.SeTa reduces training costs by up to 50\% while maintaining or improving performance, with minimal degradation even at 70\% cost reduction. Furthermore, experiments on various scale real datasets across various backbones (CNNs, Transformers, and Mambas) and diverse tasks (instruction tuning, multi-view stereo, geo-localization, composed image retrieval, referring image segmentation) demonstrate the powerful effectiveness and universality of our approach. Code is available at https://github.com/mrazhou/SeTa.
