Table of Contents
Fetching ...

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Qi Zhang, Yi Zhou, Shaofeng Zou

Abstract

This paper provides the first tight convergence analyses for RMSProp and Adam in non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance. We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an $ε$-stationary point with an iteration complexity of $\mathcal O(ε^{-4})$. We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an $ε$-stationary point with an iteration complexity of $\mathcal O(ε^{-4})$. Our complexity results for both RMSProp and Adam match with the complexity lower bound established in \cite{arjevani2023lower}.

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Abstract

This paper provides the first tight convergence analyses for RMSProp and Adam in non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance. We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an -stationary point with an iteration complexity of . We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an -stationary point with an iteration complexity of . Our complexity results for both RMSProp and Adam match with the complexity lower bound established in \cite{arjevani2023lower}.
Paper Structure (21 sections, 13 theorems, 109 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 13 theorems, 109 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Let Assumptions assump:lowerbound, assump:variance and assump:generalsmooth hold. Let $1-\beta_2= \mathcal{O}(\epsilon^{2})$, $\eta = \mathcal{O}(\epsilon^{2})$, and $T= \mathcal{O} (\epsilon^{-4})$. For $\epsilon$ such that $\epsilon\le \frac{\sqrt{5dD_0}}{\sqrt{D_1}\sqrt[4]{\zeta}}$, we have that

Figures (4)

  • Figure 1: Test Error for Original Adam and Modified Adam. The stepsize in the original Adam is set to $\frac{\eta}{\sqrt{\boldsymbol v_t}+\lambda}$ and our stepsize is set to $\frac{\eta}{\sqrt{\boldsymbol v_t+\zeta}}$. The parameters are the same as CNN task in Fig. 1 of li2023convergence, where $\eta=0.001,\beta_1=0.9,\beta_2=0.999$ and we build a six layers CNN for CIFAR 10.
  • Figure 2: Test accuracy for Original Adam and Modified Adam. The stepsize in the original Adam is set to $\frac{\eta}{\sqrt{\boldsymbol v_t}+\lambda}$ and our stepsize is set to $\frac{\eta}{\sqrt{\boldsymbol v_t+\zeta}}$. We follow the setting in yoshioka2024visiontransformers to build a vision-transformers for CIFAR 10. The stepsize is set to $\eta=0.001,\beta_1=0.9,\beta_2=0.999$.
  • Figure 3: Coordinate-wise smoothness vs. absolute gradient value on LSTM language model for the PTB datatset. Each figure presents one randomly selected coordinate.
  • Figure 4: Coordinate-wise gradient standard deviation vs. absolute gradient value on LSTM language model for the PTB datatset. Each figure presents one randomly selected coordinate.

Theorems & Definitions (23)

  • Theorem 1: Informal
  • proof : Proof sketch
  • Remark 1: Importance of modified adaptive stepsize $\boldsymbol \eta$
  • Lemma 1: Informal
  • Lemma 2
  • Lemma 3
  • Theorem 2: Informal
  • proof : Proof Sketch
  • Lemma 4
  • Lemma 5
  • ...and 13 more