Table of Contents
Fetching ...

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

Huan Li, Yiming Dong, Zhouchen Lin

TL;DR

The paper tackles nonconvex stochastic optimization and analyzes RMSProp and its momentum extension, establishing a convergence rate in the $ ext{ell}_1$ norm with explicit dimension dependence. By leveraging coordinate-wise bounded noise variance and a heavy-ball reformulation, it derives a rate of the form $ rac{1}{T} sum_{k=1}^T E[ orm{ abla f(oldsymbol{x}^k)}_1] = ilde{O}igl( rac{ oot4 t{d}}{T^{1/4}} oot4{oldsymbol{\sigma}^2 L (f(oldsymbol{x}^1)-f^*)} + rac{ oot4 t{d}}{ oot t{T}} oot t{(L(f(oldsymbol{x}^1)-f^*))}igr)$, with an alternate term that scales as $ rac{ oot4 t{d}}{T^{1/4}}$ times problem constants; this matches the SGD lower bound up to logarithmic factors in all coefficients except $d$, and is aligned with the ideal $ ext{ell}_1$ to $ ext{ell}_2$ correspondence when $ orm{ abla f(oldsymbol{x})}_1= ilde{ heta}( oot{d}) orm{ abla f(oldsymbol{x})}_2$. The approach combines a heavy-ball style analysis, careful control of the adaptive steps, and coordinate-wise noise bounds to achieve tight $L$- and $oldsymbol{\sigma}$-dependent guarantees, plus empirical evidence that gradient norms behave as $ orm{ abla f(oldsymbol{x}^k)}_1= ilde{ heta}( oot{d}) orm{ abla f(oldsymbol{x}^k)}_2$ in large-scale networks, justifying the $ ext{ell}_1$ formulation. The work situates its results among AdaGrad, RMSProp, and Adam literature and demonstrates improved dimension dependence relative to prior adaptive-method analyses, offering practical insight for high-dimensional deep learning optimization. It also provides experimental validation on CNNs and language models, confirming the theoretical and practical relevance of the proposed rates.

Abstract

Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_1\right]\leq O(\frac{\sqrt{d}C}{T^{1/4}})$ measured by $\ell_1$ norm without the bounded gradient assumption, where $d$ is the dimension of the optimization variable, $T$ is the iteration number, and $C$ is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Since $\|x\|_2\ll\|x\|_1\leq\sqrt{d}\|x\|_2$ for problems with extremely large $d$, our convergence rate can be considered to be analogous to the $\frac{1}{T}\sum_{k=1}^T E\left[\|\nabla f(x^k)\|_2\right]\leq O(\frac{C}{T^{1/4}})$ rate of SGD in the ideal case of $\|\nabla f(x)\|_1=\varTheta(\sqrt{d}\|\nabla f(x)\|_2)$.

On the $O(\frac{\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and Its Momentum Extension Measured by $\ell_1$ Norm

TL;DR

The paper tackles nonconvex stochastic optimization and analyzes RMSProp and its momentum extension, establishing a convergence rate in the norm with explicit dimension dependence. By leveraging coordinate-wise bounded noise variance and a heavy-ball reformulation, it derives a rate of the form , with an alternate term that scales as times problem constants; this matches the SGD lower bound up to logarithmic factors in all coefficients except , and is aligned with the ideal to correspondence when . The approach combines a heavy-ball style analysis, careful control of the adaptive steps, and coordinate-wise noise bounds to achieve tight - and -dependent guarantees, plus empirical evidence that gradient norms behave as in large-scale networks, justifying the formulation. The work situates its results among AdaGrad, RMSProp, and Adam literature and demonstrates improved dimension dependence relative to prior adaptive-method analyses, offering practical insight for high-dimensional deep learning optimization. It also provides experimental validation on CNNs and language models, confirming the theoretical and practical relevance of the proposed rates.

Abstract

Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of measured by norm without the bounded gradient assumption, where is the dimension of the optimization variable, is the iteration number, and is a constant identical to that appeared in the optimal convergence rate of SGD. Our convergence rate matches the lower bound with respect to all the coefficients except the dimension . Since for problems with extremely large , our convergence rate can be considered to be analogous to the rate of SGD in the ideal case of .
Paper Structure (16 sections, 8 theorems, 70 equations, 1 figure, 2 algorithms)

This paper contains 16 sections, 8 theorems, 70 equations, 1 figure, 2 algorithms.

Key Result

Theorem 1

Suppose that Assumptions 1-3 hold. Let $\eta=\frac{\gamma}{\sqrt{dT}}$, $\beta=1-\frac{1}{T}$, $\mathbf{v}_i^0=\lambda\max\left\{\sigma_i^2,\frac{1}{dT}\right\},\forall i$, and $T\geq \frac{e^2}{\lambda}$, where $\theta\in[0,1)$, $\lambda\leq 1$, and $\gamma$ can be any constants serving as hyper-pa where

Figures (1)

  • Figure 1: Illustration of the relationship $\|\nabla f(\mathbf{x}^k)\|_1=\varTheta(\sqrt{d})\|\nabla f(\mathbf{x}^k)\|_2$. We use RMSProp and RMSProp with momentum to train ResNet50 on CIFAR-100 and ImageNet, and train GPT2 on the OpenWebText dataset. The gradient norm ratio shows $\frac{\|\nabla f(\mathbf{x}^k)\|_1}{\|\nabla f(\mathbf{x}^k)\|_2}$ and the average training loss shows the average loss over training samples.

Theorems & Definitions (15)

  • Theorem 1
  • Corollary 1
  • Proof 1
  • Lemma 1
  • Lemma 2
  • Proof 2
  • Lemma 3
  • Proof 3
  • Lemma 4
  • Proof 4
  • ...and 5 more