Table of Contents
Fetching ...

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Xiangyi Chen, Sijia Liu, Ruoyu Sun, Mingyi Hong

TL;DR

The paper tackles the unresolved question of convergence for Adam-type adaptive gradient methods in non-convex optimization. It introduces a unified generalized Adam framework that includes Adam, AMSGrad, AdaGrad, and AdaFom, and derives mild, practical conditions on step sizes and momentum that guarantee convergence to first-order stationary points with a rate of $O(\log T/\sqrt{T})$. The authors also demonstrate the necessity and tightness of these conditions through targeted examples and extend the analysis to deterministic incremental variants, showing broad applicability. Empirical results on MNIST and CIFAR-10 corroborate the theoretical findings, illustrating the practical behavior of AMSGrad, Adam, AdaFom, and AdaGrad. Overall, the work clarifies when Adam-type methods converge in non-convex settings and provides guidance for algorithm design and hyperparameter tuning in practice.

Abstract

This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their popularity in training deep neural networks, the convergence of these algorithms for solving nonconvex problems remains an open question. This paper provides a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods. We prove that under our derived conditions, these methods can achieve the convergence rate of order $O(\log{T}/\sqrt{T})$ for nonconvex stochastic optimization. We show the conditions are essential in the sense that violating them may make the algorithm diverge. Moreover, we propose and analyze a class of (deterministic) incremental adaptive gradient algorithms, which has the same $O(\log{T}/\sqrt{T})$ convergence rate. Our study could also be extended to a broader class of adaptive gradient methods in machine learning and optimization.

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

TL;DR

The paper tackles the unresolved question of convergence for Adam-type adaptive gradient methods in non-convex optimization. It introduces a unified generalized Adam framework that includes Adam, AMSGrad, AdaGrad, and AdaFom, and derives mild, practical conditions on step sizes and momentum that guarantee convergence to first-order stationary points with a rate of . The authors also demonstrate the necessity and tightness of these conditions through targeted examples and extend the analysis to deterministic incremental variants, showing broad applicability. Empirical results on MNIST and CIFAR-10 corroborate the theoretical findings, illustrating the practical behavior of AMSGrad, Adam, AdaFom, and AdaGrad. Overall, the work clarifies when Adam-type methods converge in non-convex settings and provides guidance for algorithm design and hyperparameter tuning in practice.

Abstract

This paper studies a class of adaptive gradient based momentum algorithms that update the search directions and learning rates simultaneously using past gradients. This class, which we refer to as the "Adam-type", includes the popular algorithms such as the Adam, AMSGrad and AdaGrad. Despite their popularity in training deep neural networks, the convergence of these algorithms for solving nonconvex problems remains an open question. This paper provides a set of mild sufficient conditions that guarantee the convergence for the Adam-type methods. We prove that under our derived conditions, these methods can achieve the convergence rate of order for nonconvex stochastic optimization. We show the conditions are essential in the sense that violating them may make the algorithm diverge. Moreover, we propose and analyze a class of (deterministic) incremental adaptive gradient algorithms, which has the same convergence rate. Our study could also be extended to a broader class of adaptive gradient methods in machine learning and optimization.

Paper Structure

This paper contains 21 sections, 12 theorems, 77 equations, 8 figures, 1 table.

Key Result

Theorem 3.1

Suppose that Assumptions A1-A3 are satisfied, $\beta_1$ is chosen such that $\beta_1 \geq \beta_{1,t}$, $\beta_{1,t} \in [0,1)$ is non-increasing, and for some constant $G>0$, $\left\|\alpha_t m_t/\sqrt{{\hbox{$\hat{v}$}}_t}\right\| \leq G, \; \forall~t.$ Then Algorithm 1 yields where $C_1, C_2, C_3$ are constants independent of $d$ and $T$, $C_4$ is a constant independent of $T$, the expectation

Figures (8)

  • Figure 1: A toy example to illustrate effect of Term A on Adam, AMSGrad, and SGD.
  • Figure 2: A toy example to illustrate effect of Term B on Adam and AMSGrad.
  • Figure 3: Comparison of AMSGrad, Adam, AdaFom and AdaGrad under MNIST in training loss and testing accuracy.
  • Figure 4: Comparison of AMSGrad, Adam, AdaFom and AdaGrad under CIFAR in training loss and testing accuracy.
  • Figure A1: Comparison of algorithms with $\alpha_t = 0.1$, we defined $\alpha_0=0$
  • ...and 3 more figures

Theorems & Definitions (12)

  • Theorem 3.1
  • Corollary 3.1
  • Corollary 3.2
  • Lemma 6.1
  • Lemma 6.2
  • Lemma 6.3
  • Lemma 6.4
  • Lemma 6.5
  • Lemma 6.6
  • Lemma 6.7
  • ...and 2 more