On the SDEs and Scaling Rules for Adaptive Gradient Algorithms
Sadhika Malladi, Kaifeng Lyu, Abhishek Panigrahi, Sanjeev Arora
TL;DR
This work provides rigorously proven Itô SDE approximations for RMSprop and Adam, establishing that these adaptive optimizers admit 1st-order weak approximations and enabling square-root scaling rules when batch size changes. By introducing the Noisy Gradient Oracle with Scale Parameter and carefully bounding approximation errors under well-behaved noise and polynomial-growth conditions, the authors show how to adjust hyperparameters (e.g., $\
Abstract
Approximating Stochastic Gradient Descent (SGD) as a Stochastic Differential Equation (SDE) has allowed researchers to enjoy the benefits of studying a continuous optimization trajectory while carefully preserving the stochasticity of SGD. Analogous study of adaptive gradient methods, such as RMSprop and Adam, has been challenging because there were no rigorously proven SDE approximations for these methods. This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings. A key practical result is the derivation of a $\textit{square root scaling rule}$ to adjust the optimization hyperparameters of RMSprop and Adam when changing batch size, and its empirical validation in deep learning settings.
