Table of Contents
Fetching ...

Stochastic Training is Not Necessary for Generalization

Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom Goldstein

TL;DR

The paper challenges the widely held view that SGD's implicit regularization is essential for neural network generalization. It demonstrates that non-stochastic full-batch training, when paired with data augmentation and explicit regularization such as gradient-penalty terms and gradient clipping, can match or exceed SGD performance on CIFAR-10 with ResNet-18. It further shows that a fully non-stochastic setting without data augmentation can still achieve over 95% accuracy, provided hyperparameters are carefully tuned and the training regime is extended, though at great computational cost. These findings imply that gradient noise from mini-batching is not strictly necessary for generalization and that explicit regularization can replicate its beneficial effects, prompting a re-evaluation of theories tied solely to stochastic optimization dynamics.

Abstract

It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.

Stochastic Training is Not Necessary for Generalization

TL;DR

The paper challenges the widely held view that SGD's implicit regularization is essential for neural network generalization. It demonstrates that non-stochastic full-batch training, when paired with data augmentation and explicit regularization such as gradient-penalty terms and gradient clipping, can match or exceed SGD performance on CIFAR-10 with ResNet-18. It further shows that a fully non-stochastic setting without data augmentation can still achieve over 95% accuracy, provided hyperparameters are carefully tuned and the training regime is extended, though at great computational cost. These findings imply that gradient noise from mini-batching is not strictly necessary for generalization and that explicit regularization can replicate its beneficial effects, prompting a re-evaluation of theories tied solely to stochastic optimization dynamics.

Abstract

It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.

Paper Structure

This paper contains 33 sections, 12 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: One-dimensional loss landscapes (random direction) of models trained with gradient descent. Default full-batch gradient descent (left) produces sharp models that neither train nor generalize well, yet it can be modified to converge to flatter minima with longer training, gradient clipping and appropriate regularization (right).
  • Figure 2: Cross-Entropy Loss on the training and validation set and full loss (including weight decay) during training for full-batch gradient descent. Left: training as described in \ref{['sec:trainlonger']} without clipping, right: with gradient clipping. Clipped steps are marked in black. Validation computed every 100 steps.
  • Figure 3: Cross-Entropy Loss on the training and validation set and full loss (including weight decay) during training for full-batch gradient descent. Clipped steps are marked in black. Validation computed every 100 steps. Training with gradient regularization: FB regularized) on the left and FB strong reg. on the right, both with a learning rate of 0.4.
  • Figure 4: One-dimensional loss landscapes visualizations (random direction) of models trained with gradient descent, going from SGD (left) to GD with successive modifications (right). Whereas the models trained with unmodified gradient descent (middle) are noticeably sharper than the model trained with stochastic gradient descent (left), the final model trained with modified gradient descent (right) replicates the qualitative properties of the SGD model.
  • Figure 5: Cross-Entropy Loss on the training and validation set and full loss (including weight decay) during training for full-batch gradient descent. Clipped steps are marked in black. Validation computed every 100 steps. From top to bottom: Top: training as described in \ref{['sec:trainlonger']} with and without clipping. All other rows: training with gradient regularization: FB regularized) on the left and FB strong reg. on the right. Second from the top: Training with lr=0.4. Third from the top: Training with lr=0.8 (this is the final setting proposed in this work). Bottom: Training with lr=1.6.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Remark : The Practical View