Table of Contents
Fetching ...

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

Yulin Wang, Yang Yue, Rui Lu, Yizeng Han, Shiji Song, Gao Huang

TL;DR

EfficientTrain++ generalizes curriculum learning by applying an epoch-dependent, soft transformation that reveals easier-to-learn patterns within each input, notably low-frequency content, while preserving access to all data. The method leverages frequency-domain cropping and controlled augmentation to form a unified curriculum, optimized by a computationally constrained search to minimize training cost. Across supervised and self-supervised settings on ImageNet-1K/22K, EfficientTrain++ delivers consistent 1.5×–3× training-time speedups with competitive or improved accuracy, and it transfers effectively to downstream tasks. The approach is simple, model-agnostic, and compatible with sample-selection methods, offering substantial practical impact for accelerating large-scale visual backbone training and pre-training.

Abstract

The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).

EfficientTrain++: Generalized Curriculum Learning for Efficient Visual Backbone Training

TL;DR

EfficientTrain++ generalizes curriculum learning by applying an epoch-dependent, soft transformation that reveals easier-to-learn patterns within each input, notably low-frequency content, while preserving access to all data. The method leverages frequency-domain cropping and controlled augmentation to form a unified curriculum, optimized by a computationally constrained search to minimize training cost. Across supervised and self-supervised settings on ImageNet-1K/22K, EfficientTrain++ delivers consistent 1.5×–3× training-time speedups with competitive or improved accuracy, and it transfers effectively to downstream tasks. The approach is simple, model-agnostic, and compatible with sample-selection methods, offering substantial practical impact for accelerating large-scale visual backbone training and pre-training.

Abstract

The superior performance of modern visual backbones usually comes with a costly training procedure. We contribute to this issue by generalizing the idea of curriculum learning beyond its original formulation, i.e., training models using easier-to-harder data. Specifically, we reformulate the training curriculum as a soft-selection function, which uncovers progressively more difficult patterns within each example during training, instead of performing easier-to-harder sample selection. Our work is inspired by an intriguing observation on the learning dynamics of visual backbones: during the earlier stages of training, the model predominantly learns to recognize some 'easier-to-learn' discriminative patterns in the data. These patterns, when observed through frequency and spatial domains, incorporate lower-frequency components, and the natural image contents without distortion or data augmentation. Motivated by these findings, we propose a curriculum where the model always leverages all the training data at every learning stage, yet the exposure to the 'easier-to-learn' patterns of each example is initiated first, with harder patterns gradually introduced as training progresses. To implement this idea in a computationally efficient way, we introduce a cropping operation in the Fourier spectrum of the inputs, enabling the model to learn from only the lower-frequency components. Then we show that exposing the contents of natural images can be readily achieved by modulating the intensity of data augmentation. Finally, we integrate these aspects and design curriculum schedules with tailored search algorithms. The resulting method, EfficientTrain++, is simple, general, yet surprisingly effective. It reduces the training time of a wide variety of popular models by 1.5-3.0x on ImageNet-1K/22K without sacrificing accuracy. It also demonstrates efficacy in self-supervised learning (e.g., MAE).
Paper Structure (39 sections, 1 theorem, 25 equations, 12 figures, 25 tables, 2 algorithms)

This paper contains 39 sections, 1 theorem, 25 equations, 12 figures, 25 tables, 2 algorithms.

Key Result

Proposition 1

Suppose that $\boldsymbol{X}_{\textnormal{c}}\!=\!\mathcal{F}^{-1} \circ \mathcal{C}_{B, B} \circ \mathcal{F}(\boldsymbol{X})$, and that $\boldsymbol{X}_{\textnormal{d}}\!=\!\mathcal{D}_{B, B}(\boldsymbol{X})$, where $B\!\times\!B$ down-sampling $\mathcal{D}_{B, B}(\cdot)$ is realized by a common in

Figures (12)

  • Figure 1: (a) Sample-wise curriculum learning (CL): making a discrete decision on whether each example should be leveraged to train the model. (b) Generalized CL: we consider a continuous function $\mathcal{T}_t(\cdot)$, which only exposes the 'easier-to-learn' patterns within each example at the beginning of training (e.g., lower-frequency components; see: Section \ref{['sec:EfficientTrain_sec4']}), while gradually introducing relatively more difficult patterns as learning progresses.
  • Figure 2: Low-pass filtering. Following wang2020high, we adopt a circular filter.
  • Figure 3: Ablation study results with low-pass filtering ($r$: bandwidth of the filter, see Figure \ref{['fig:low_pass_filtering']} for details). We ablate the higher-frequency components of the inputs for a DeiT-Small touvron2021training, and present the curves of validation accuracy v.s. training epochs on ImageNet-1K. We highlight the separation points of the curves with black boxes.
  • Figure 4: Performing low-pass filtering only on the validation inputs (other setups are the same as Figure \ref{['fig:low_pass_training']}). We train a model using the original images without any filtering (i.e., containing both lower- and higher-frequency components), and evaluate all the intermediate checkpoints on low-pass filtered validation sets with varying bandwidths.
  • Figure 5: Low-frequency cropping in the frequency domain (${B}^2$: bandwidth).
  • ...and 7 more figures

Theorems & Definitions (1)

  • Proposition 1