Table of Contents
Fetching ...

Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training

Shriram M Sathiyanarayanan, Xinyue Hao, Shihao Hou, Yang Lu, Laura Sevilla-Lara, Anurag Arnab, Shreyank N Gowda

TL;DR

Progressive Data Dropout (PDD) proposes progressively dropping training samples across epochs to accelerate neural network training without altering model architecture or optimization. It presents three variants—Difficulty-First Training (DBPD), Scalar Random Dropout (SRD), and Schedule-Matched Random Dropout (SMRD)—with SMRD often delivering the best accuracy, while all variants substantially reduce effective epochs (EE). Empirical results span supervised image classification on CIFAR-10/100 and ImageNet, self-supervised MAE pretraining, and NLP generalization, showing up to 4.82% accuracy gains and up to an order-of-magnitude EE reduction (e.g., from 800 to 50 in MAE pretraining). The approach is simple to integrate, adaptable across architectures, and supported by mathematical approximations for SMRD schedules to facilitate practical deployment, offering a data-centric path to faster, greener training.

Abstract

The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision

Progressive Data Dropout: An Embarrassingly Simple Approach to Faster Training

TL;DR

Progressive Data Dropout (PDD) proposes progressively dropping training samples across epochs to accelerate neural network training without altering model architecture or optimization. It presents three variants—Difficulty-First Training (DBPD), Scalar Random Dropout (SRD), and Schedule-Matched Random Dropout (SMRD)—with SMRD often delivering the best accuracy, while all variants substantially reduce effective epochs (EE). Empirical results span supervised image classification on CIFAR-10/100 and ImageNet, self-supervised MAE pretraining, and NLP generalization, showing up to 4.82% accuracy gains and up to an order-of-magnitude EE reduction (e.g., from 800 to 50 in MAE pretraining). The approach is simple to integrate, adaptable across architectures, and supported by mathematical approximations for SMRD schedules to facilitate practical deployment, offering a data-centric path to faster, greener training.

Abstract

The success of the machine learning field has reliably depended on training on large datasets. While effective, this trend comes at an extraordinary cost. This is due to two deeply intertwined factors: the size of models and the size of datasets. While promising research efforts focus on reducing the size of models, the other half of the equation remains fairly mysterious. Indeed, it is surprising that the standard approach to training remains to iterate over and over, uniformly sampling the training dataset. In this paper we explore a series of alternative training paradigms that leverage insights from hard-data-mining and dropout, simple enough to implement and use that can become the new training standard. The proposed Progressive Data Dropout reduces the number of effective epochs to as little as 12.4% of the baseline. This savings actually do not come at any cost for accuracy. Surprisingly, the proposed method improves accuracy by up to 4.82%. Our approach requires no changes to model architecture or optimizer, and can be applied across standard training pipelines, thus posing an excellent opportunity for wide adoption. Code can be found here: https://github.com/bazyagami/LearningWithRevision

Paper Structure

This paper contains 35 sections, 11 equations, 4 figures, 25 tables, 1 algorithm.

Figures (4)

  • Figure 1: Using the proposed variants of Progressive Data Dropout, we can not only outperform the traditional baseline, but we can reduce the number of effective epochs significantly.This finding is consistent across a range of models and datasets.
  • Figure 2: A possible reasoning for the effectiveness of our approach is that random dropout maintains representative coverage and introduces beneficial stochasticity. Here, we see an example of the number of times each sample goes through backpropagation. Traditionally, each sample would go through this 'total epochs' number of times (red line).
  • Figure 3: We evaluate EfficientNet and MobileNet by adding more than 1 epoch worth of revision. We see that after an initial boost, there is a drop in performance.
  • Figure 4: Here, we notice an exponential decay in the number of samples being picked at each epoch for backpropagation. The last epoch includes all samples again. The lighter plot denotes threshold 0.7 while the darker one represents 0.3.