Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Qi Zhang; Yi Zhou; Shaofeng Zou

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Qi Zhang, Yi Zhou, Shaofeng Zou

Abstract

This paper provides the first tight convergence analyses for RMSProp and Adam in non-convex optimization under the most relaxed assumptions of coordinate-wise generalized smoothness and affine noise variance. We first analyze RMSProp, which is a special case of Adam with adaptive learning rates but without first-order momentum. Specifically, to solve the challenges due to dependence among adaptive update, unbounded gradient estimate and Lipschitz constant, we demonstrate that the first-order term in the descent lemma converges and its denominator is upper bounded by a function of gradient norm. Based on this result, we show that RMSProp with proper hyperparameters converges to an $ε$-stationary point with an iteration complexity of $\mathcal O(ε^{-4})$. We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an $ε$-stationary point with an iteration complexity of $\mathcal O(ε^{-4})$. Our complexity results for both RMSProp and Adam match with the complexity lower bound established in \cite{arjevani2023lower}.

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Abstract

-stationary point with an iteration complexity of

. We then generalize our analysis to Adam, where the additional challenge is due to a mismatch between the gradient and first-order momentum. We develop a new upper bound on the first-order term in the descent lemma, which is also a function of the gradient norm. We show that Adam with proper hyperparameters converges to an

-stationary point with an iteration complexity of

. Our complexity results for both RMSProp and Adam match with the complexity lower bound established in \cite{arjevani2023lower}.

Paper Structure (21 sections, 13 theorems, 109 equations, 4 figures, 1 table, 1 algorithm)

This paper contains 21 sections, 13 theorems, 109 equations, 4 figures, 1 table, 1 algorithm.

Introduction
Related work
Relaxed Assumptions
Adaptive Optimizers
Preliminaries
Technical Assumptions
Challenges and Insights
Convergence Analysis of RMSProp
Convergence Analysis of Adam
Comparison with Existing Works
Conclusion
Acknowledgment
Formal Version of Lemma \ref{['lemma:1']} and Its Proof
Proof of Lemma \ref{['co:1']}
Proof of Lemma \ref{['lemma:2']}
...and 6 more sections

Key Result

Theorem 1

Let Assumptions assump:lowerbound, assump:variance and assump:generalsmooth hold. Let $1-\beta_2= \mathcal{O}(\epsilon^{2})$, $\eta = \mathcal{O}(\epsilon^{2})$, and $T= \mathcal{O} (\epsilon^{-4})$. For $\epsilon$ such that $\epsilon\le \frac{\sqrt{5dD_0}}{\sqrt{D_1}\sqrt[4]{\zeta}}$, we have that

Figures (4)

Figure 1: Test Error for Original Adam and Modified Adam. The stepsize in the original Adam is set to $\frac{\eta}{\sqrt{\boldsymbol v_t}+\lambda}$ and our stepsize is set to $\frac{\eta}{\sqrt{\boldsymbol v_t+\zeta}}$. The parameters are the same as CNN task in Fig. 1 of li2023convergence, where $\eta=0.001,\beta_1=0.9,\beta_2=0.999$ and we build a six layers CNN for CIFAR 10.
Figure 2: Test accuracy for Original Adam and Modified Adam. The stepsize in the original Adam is set to $\frac{\eta}{\sqrt{\boldsymbol v_t}+\lambda}$ and our stepsize is set to $\frac{\eta}{\sqrt{\boldsymbol v_t+\zeta}}$. We follow the setting in yoshioka2024visiontransformers to build a vision-transformers for CIFAR 10. The stepsize is set to $\eta=0.001,\beta_1=0.9,\beta_2=0.999$.
Figure 3: Coordinate-wise smoothness vs. absolute gradient value on LSTM language model for the PTB datatset. Each figure presents one randomly selected coordinate.
Figure 4: Coordinate-wise gradient standard deviation vs. absolute gradient value on LSTM language model for the PTB datatset. Each figure presents one randomly selected coordinate.

Theorems & Definitions (23)

Theorem 1: Informal
proof : Proof sketch
Remark 1: Importance of modified adaptive stepsize $\boldsymbol \eta$
Lemma 1: Informal
Lemma 2
Lemma 3
Theorem 2: Informal
proof : Proof Sketch
Lemma 4
Lemma 5
...and 13 more

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Abstract

Convergence Guarantees for RMSProp and Adam in Generalized-smooth Non-convex Optimization with Affine Noise Variance

Authors

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (23)