94% on CIFAR-10 in 3.29 Seconds on a Single GPU

Keller Jordan

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

Keller Jordan

TL;DR

The paper tackles the problem of accelerating CIFAR-10 training under fixed hardware by introducing airbench, a suite of methods including patch-whitening and partial identity initializations, Lookahead optimization, scaled BN biases, alternating flip augmentation, and multi-crop evaluation, with optional Torch compilation. It demonstrates unprecedented speedups, achieving approximately $0.94$ accuracy in $3.29$ seconds on a single NVIDIA A100, and targets of $0.95$ in $10.4$ seconds and $0.96$ in $46.3$ seconds, while releasing the code for reproducibility. Key contributions include the derandomized alternating flip that reduces redundancy, additive speedup interactions among features, and extensive experiments showing generalization to CIFAR-100 and ImageNet settings under certain crop strategies. The practical impact is substantial for rapid hyperparameter studies and large-scale training studies, enabling faster statistical significance assessments and reduced computational cost.

Abstract

CIFAR-10 is among the most widely used datasets in machine learning, facilitating thousands of research projects per year. To accelerate research and reduce the cost of experiments, we introduce training methods for CIFAR-10 which reach 94% accuracy in 3.29 seconds, 95% in 10.4 seconds, and 96% in 46.3 seconds, when run on a single NVIDIA A100 GPU. As one factor contributing to these training speeds, we propose a derandomized variant of horizontal flipping augmentation, which we show improves over the standard method in every case where flipping is beneficial over no flipping at all. Our code is released at https://github.com/KellerJordan/cifar10-airbench.

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

TL;DR

accuracy in

seconds on a single NVIDIA A100, and targets of

seconds and

seconds, while releasing the code for reproducibility. Key contributions include the derandomized alternating flip that reduces redundancy, additive speedup interactions among features, and extensive experiments showing generalization to CIFAR-100 and ImageNet settings under certain crop strategies. The practical impact is substantial for rapid hyperparameter studies and large-scale training studies, enabling faster statistical significance assessments and reduced computational cost.

Abstract

Paper Structure (21 sections, 6 figures, 6 tables)

This paper contains 21 sections, 6 figures, 6 tables.

Introduction
Background
Methods
Network architecture and baseline training
Frozen patch-whitening initialization
Identity initialization
Optimization tricks
Multi-crop evaluation
Alternating flip
Compilation
95% and 96% targets
Experiments
Interaction between features
Does alternating flip generalize?
Variance and class-wise calibration
...and 6 more sections

Figures (6)

Figure 1: Alternating flip. In computer vision we typically train neural networks using random horizontal flipping augmentation, which flips each image with 50% probability per epoch. This results in some images being redundantly flipped the same way for many epochs in a row. We propose (Section \ref{['sec:altflip']}) to flip images in a deterministically alternating manner after the first epoch, avoiding this redundancy and speeding up training.
Figure 2: The first layer's weights after whitening initialization hlbCIFAR10paged2019resnet
Figure 3: FLOPs vs. error rate tradeoff. Our three training methods apparently follow a linear log-log relationship between FLOPs and error rate.
Figure 4: Training speedups accumulate additively. Removing individual features from airbench94 increases the epochs-to-94%. Adding the same features to the whitened baseline training (Section \ref{['sec:whiten']}) reduces the epochs-to-94%. For every feature except multi-crop TTA (Section \ref{['sec:multicrop']}), these two changes in in epochs-to-94% are roughly the same, suggesting that training speedups accumulate additively rather than multiplicatively.
Figure 5: Alternating flip boosts performance. Across a variety of settings for airbench94 and airbench96, the use of alternating flip rather than random flip consistently boosts performance by the equivalent of a 0-25% training speedup. The benefit generalizes to ImageNet trainings which use light augmentation other than flipping. 95% confidence intervals are shown around each point.
...and 1 more figures

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

TL;DR

Abstract

94% on CIFAR-10 in 3.29 Seconds on a Single GPU

Authors

TL;DR

Abstract

Table of Contents

Figures (6)