Table of Contents
Fetching ...

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu, Leon Bottou

TL;DR

This work provides theoretical guarantees for AdaGrad-Norm in smooth nonconvex optimization, showing it converges to stationary points with a stochastic rate of O(log(N)/sqrt(N)) and a batch rate of O(1/N) while remaining robust to hyperparameters such as the initial scale b_0 and the global stepsize eta. The analysis handles the coupling between the adaptive scaling b_j and the stochastic gradients, and it leverages Descent Lemma and log-sum arguments to derive explicit constants. Practically, AdaGrad-Norm reduces the need to tune Lipschitz constants or gradient noise levels, while numerical experiments on synthetic data and deep learning models indicate strong robustness and competitive performance without heavy hyperparameter tuning. The results offer a theoretical foundation for the empirical success of adaptive gradient methods in nonconvex optimization and suggest avenues for extending the framework to other adaptive schemes.

Abstract

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the $\mathcal{O}(\log(N)/\sqrt{N})$ rate in the stochastic setting, and at the optimal $\mathcal{O}(1/N)$ rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theory; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization.

AdaGrad stepsizes: Sharp convergence over nonconvex landscapes

TL;DR

This work provides theoretical guarantees for AdaGrad-Norm in smooth nonconvex optimization, showing it converges to stationary points with a stochastic rate of O(log(N)/sqrt(N)) and a batch rate of O(1/N) while remaining robust to hyperparameters such as the initial scale b_0 and the global stepsize eta. The analysis handles the coupling between the adaptive scaling b_j and the stochastic gradients, and it leverages Descent Lemma and log-sum arguments to derive explicit constants. Practically, AdaGrad-Norm reduces the need to tune Lipschitz constants or gradient noise levels, while numerical experiments on synthetic data and deep learning models indicate strong robustness and competitive performance without heavy hyperparameter tuning. The results offer a theoretical foundation for the empirical success of adaptive gradient methods in nonconvex optimization and suggest avenues for extending the framework to other adaptive schemes.

Abstract

Adaptive gradient methods such as AdaGrad and its variants update the stepsize in stochastic gradient descent on the fly according to the gradients received along the way; such methods have gained widespread use in large-scale optimization for their ability to converge robustly, without the need to fine-tune the stepsize schedule. Yet, the theoretical guarantees to date for AdaGrad are for online and convex optimization. We bridge this gap by providing theoretical guarantees for the convergence of AdaGrad for smooth, nonconvex functions. We show that the norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the rate in the stochastic setting, and at the optimal rate in the batch (non-stochastic) setting -- in this sense, our convergence guarantees are 'sharp'. In particular, the convergence of AdaGrad-Norm is robust to the choice of all hyper-parameters of the algorithm, in contrast to stochastic gradient descent whose convergence depends crucially on tuning the step-size to the (generally unknown) Lipschitz smoothness constant and level of stochastic noise on the gradient. Extensive numerical experiments are provided to corroborate our theory; moreover, the experiments suggest that the robustness of AdaGrad-Norm extends to state-of-the-art models in deep learning, without sacrificing generalization.

Paper Structure

This paper contains 23 sections, 7 theorems, 57 equations, 7 figures, 2 tables, 3 algorithms.

Key Result

Theorem 2.1

Suppose $F \in \mathbb{C}_L^1$ and $F^{*} = \inf_{x}F(x)>-\infty$. Suppose that the random variables $G_{\ell}, \ell \geq 0$, satisfy the above assumptions. Then with probability $1- \delta$, where

Figures (7)

  • Figure 1: Gaussian Data -- Stochastic Setting. The top 3 figures plot the square of the gradient norm for linear regression, $\|A^{T}\left(Ax_j - y\right) \| /m$, w.r.t. $b_0$, at iterations 10, 2000 and 5000 (see title) respectively. The bottom 3 figures plot the corresponding effective learning rates (median of $\{b_j(\ell)\}_{\ell=1}^d$ for AdaGrad-Coordinate), w.r.t. $b_0$, at iteration 10, 2000 and 5000 respectively (see title).
  • Figure 2: Gaussian Data - Batch Setting. The y-axis and x-axis in the top and middle 3 figures are the same as in Figure 1. The bottom 3 figures plot the accumulated computational time (AccuTime) up to iteration $50$, $100$ and $200$ (see title), as a function of $b_0$.
  • Figure 3: MNIST. In each plot, the y-axis is the train or test accuracy and the x-axis is $b_0$. The 6 plots are for logistic regression (LogReg) with average at epoch 1-5, 11-15 and 26-30. The title is the last epoch of the average. Note green and red curves overlap when $b_0$ belongs to $[10, \infty)$
  • Figure 4: In each plot, the y-axis is the train or test accuracy and the x-axis is $b_0$. Top left 6 plots are for MNIST using the two-layer fully connected network (ReLU activation). Top right 6 plots are for MNIST using convolution neural network (CNN). Bottom left 6 plots are for CIFAR10 using ResNet-18 with disabling learnable parameter in Batch-Norm. Bottom right 6 plots are for CIFAR10 using ResNet-18 with default Batch-Norm. The points in the (top) bottom plot are the average of epoch (1-5) 6-10, epoch (11-15) 41-45 or epoch (26-30) 86-90. The title is the last epoch of the average. Note green, red and black curves overlap when $b_0$ belongs to $[10, \infty)$. Better read on screen.
  • Figure 5: ImageNet trained with model ResNet-50. The y-axis is the average train or test accuracy at epoch 26-30, 46-50, 86-90 w.r.t. $b_0^2$. Note no momentum is used in the training. See Experimental Details. Note green, red and black curves overlap when $b_0$ belongs to $[10, \infty)$.
  • ...and 2 more figures

Theorems & Definitions (7)

  • Theorem 2.1: AdaGrad-Norm: convergence in stochastic setting
  • Theorem 2.2: AdaGrad-Norm: convergence in deterministic setting
  • Lemma 2.1
  • Lemma 3.1: Descent Lemma
  • Lemma 3.2
  • Lemma 4.1
  • Lemma 4.2