Stochastic Training is Not Necessary for Generalization
Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, Tom Goldstein
TL;DR
The paper challenges the widely held view that SGD's implicit regularization is essential for neural network generalization. It demonstrates that non-stochastic full-batch training, when paired with data augmentation and explicit regularization such as gradient-penalty terms and gradient clipping, can match or exceed SGD performance on CIFAR-10 with ResNet-18. It further shows that a fully non-stochastic setting without data augmentation can still achieve over 95% accuracy, provided hyperparameters are carefully tuned and the training regime is extended, though at great computational cost. These findings imply that gradient noise from mini-batching is not strictly necessary for generalization and that explicit regularization can replicate its beneficial effects, prompting a re-evaluation of theories tied solely to stochastic optimization dynamics.
Abstract
It is widely believed that the implicit regularization of SGD is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve comparably strong performance to SGD on CIFAR-10 using modern architectures. To this end, we show that the implicit regularization of SGD can be completely replaced with explicit regularization even when comparing against a strong and well-researched baseline. Our observations indicate that the perceived difficulty of full-batch training may be the result of its optimization properties and the disproportionate time and effort spent by the ML community tuning optimizers and hyperparameters for small-batch training.
