AdaGrad stepsizes: Sharp convergence over nonconvex landscapes
Rachel Ward, Xiaoxia Wu, Leon Bottou
TL;DR
This work provides theoretical guarantees for AdaGrad-Norm in smooth nonconvex optimization, showing it converges to stationary points with a stochastic rate of O(log(N)/sqrt(N)) and a batch rate of O(1/N) while remaining robust to hyperparameters such as the initial scale b_0 and the global stepsize eta. The analysis handles the coupling between the adaptive scaling b_j and the stochastic gradients, and it leverages Descent Lemma and log-sum arguments to derive explicit constants. Practically, AdaGrad-Norm reduces the need to tune Lipschitz constants or gradient noise levels, while numerical experiments on synthetic data and deep learning models indicate strong robustness and competitive performance without heavy hyperparameter tuning. The results offer a theoretical foundation for the empirical success of adaptive gradient methods in nonconvex optimization and suggest avenues for extending the framework to other adaptive schemes.
Abstract
Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the $\mathcal{O}(\log(N)/\sqrt{N})$ rate in the stochastic setting, and at the optimal $\mathcal{O}(1/N)$ rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theory; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization.
