On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

Huan Li; Yiming Dong; Zhouchen Lin

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

Huan Li, Yiming Dong, Zhouchen Lin

TL;DR

The paper tackles nonconvex stochastic optimization and analyzes RMSProp and its momentum extension, establishing a convergence rate in the $ ext{ell}_1$ norm with explicit dimension dependence. By leveraging coordinate-wise bounded noise variance and a heavy-ball reformulation, it derives a rate of the form $rac{1}{T} sum_{k=1}^T E[ orm{ abla f(oldsymbol{x}^k)}_1] = ilde{O}igl(rac{ oot4 t{d}}{T^{1/4}} oot4{oldsymbol{\sigma}^2 L (f(oldsymbol{x}^1)-f^*)} + rac{ oot4 t{d}}{ oot t{T}} oot t{(L(f(oldsymbol{x}^1)-f^*))}igr)$, with an alternate term that scales as $rac{ oot4 t{d}}{T^{1/4}}$ times problem constants; this matches the SGD lower bound up to logarithmic factors in all coefficients except $d$, and is aligned with the ideal $ ext{ell}_1$ to $ ext{ell}_2$ correspondence when $ orm{ abla f(oldsymbol{x})}_1= ilde{ heta}( oot{d}) orm{ abla f(oldsymbol{x})}_2$. The approach combines a heavy-ball style analysis, careful control of the adaptive steps, and coordinate-wise noise bounds to achieve tight $L$- and $oldsymbol{\sigma}$-dependent guarantees, plus empirical evidence that gradient norms behave as $ orm{ abla f(oldsymbol{x}^k)}_1= ilde{ heta}( oot{d}) orm{ abla f(oldsymbol{x}^k)}_2$ in large-scale networks, justifying the $ ext{ell}_1$ formulation. The work situates its results among AdaGrad, RMSProp, and Adam literature and demonstrates improved dimension dependence relative to prior adaptive-method analyses, offering practical insight for high-dimensional deep learning optimization. It also provides experimental validation on CNNs and language models, confirming the theoretical and practical relevance of the proposed rates.

Abstract

Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{T^{1/4}})$ measured by $\ell_1$ norm without the bounded gradient assumption, where $d$ is the dimension of the optimization variable, $T$ is the iteration number, and $C$ is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Since $\|x\|_2\ll\|x\|_1\leq\sqrt{d}\|x\|_2$ for problems with extremely large $d$, our convergence rate can be considered to be analogous to the $\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{T^{1/4}})$ rate of SGD in the ideal case of $\|\nabla f(x)\|_1=\varTheta(\sqrt{d}\|\nabla f(x)\|_2)$.

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

TL;DR

The paper tackles nonconvex stochastic optimization and analyzes RMSProp and its momentum extension, establishing a convergence rate in the

norm with explicit dimension dependence. By leveraging coordinate-wise bounded noise variance and a heavy-ball reformulation, it derives a rate of the form

, with an alternate term that scales as

times problem constants; this matches the SGD lower bound up to logarithmic factors in all coefficients except

, and is aligned with the ideal

correspondence when

. The approach combines a heavy-ball style analysis, careful control of the adaptive steps, and coordinate-wise noise bounds to achieve tight

- and

-dependent guarantees, plus empirical evidence that gradient norms behave as

in large-scale networks, justifying the

formulation. The work situates its results among AdaGrad, RMSProp, and Adam literature and demonstrates improved dimension dependence relative to prior adaptive-method analyses, offering practical insight for high-dimensional deep learning optimization. It also provides experimental validation on CNNs and language models, confirming the theoretical and practical relevance of the proposed rates.

Abstract

measured by

norm without the bounded gradient assumption, where

is the dimension of the optimization variable,

is the iteration number, and

is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension

. Since

for problems with extremely large

, our convergence rate can be considered to be analogous to the

rate of SGD in the ideal case of

Paper Structure (16 sections, 8 theorems, 70 equations, 1 figure, 2 algorithms)

This paper contains 16 sections, 8 theorems, 70 equations, 1 figure, 2 algorithms.

Introduction
Contribution
Notations and Assumptions
Convergence Rates of RMSProp and Its Momentum Extension
Literature Comparisons
Convergence Rate of AdaGrad in hong-2024-adagrad
Convergence Rate of AdaGrad in Liu-2023-icml
Convergence Rate of RMSProp in luo-2020-iclr
Convergence Rate of RMSProp in bottou-2022-tmlr
Convergence Rate of Adam in haochuanli-2023
Other works
Proof of Theorem \ref{['theorem']}
Supporting Lemmas
Experimental Details
Proof of Lemma \ref{['lemma1']}
...and 1 more sections

Key Result

Theorem 1

Suppose that Assumptions 1-3 hold. Let $\eta=\frac{\gamma}{\sqrt{dT}}$, $\beta=1-\frac{1}{T}$, $\mathbf{v}_i^0=\lambda\max\left\{\sigma_i^2,\frac{1}{dT}\right\},\forall i$, and $T\geq \frac{e^2}{\lambda}$, where $\theta\in[0,1)$, $\lambda\leq 1$, and $\gamma$ can be any constants serving as hyper-pa where

Figures (1)

Figure 1: Illustration of the relationship $\|\nabla f(\mathbf{x}^k)\|_1=\varTheta(\sqrt{d})\|\nabla f(\mathbf{x}^k)\|_2$. We use RMSProp and RMSProp with momentum to train ResNet50 on CIFAR-100 and ImageNet, and train GPT2 on the OpenWebText dataset. The gradient norm ratio shows $\frac{\|\nabla f(\mathbf{x}^k)\|_1}{\|\nabla f(\mathbf{x}^k)\|_2}$ and the average training loss shows the average loss over training samples.

Theorems & Definitions (15)

Theorem 1
Corollary 1
Proof 1
Lemma 1
Lemma 2
Proof 2
Lemma 3
Proof 3
Lemma 4
Proof 4
...and 5 more

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

TL;DR

Abstract

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (15)