Table of Contents
Fetching ...

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

A new adaptive gradient method named ADOPT is proposed, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption.

Abstract

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $β_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $β_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

TL;DR

A new adaptive gradient method named ADOPT is proposed, which achieves the optimal convergence rate of with any choice of without depending on the bounded noise assumption.

Abstract

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., , in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of with any choice of without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

Paper Structure

This paper contains 22 sections, 21 theorems, 84 equations, 7 figures, 5 tables, 2 algorithms.

Key Result

Theorem 3.1

Under Assumptions ass:bounded_objective-ass:smooth and ass:second_moment, the following holds for the RMSprop with a constant learning rate $\alpha_t = \alpha$: where $C_1 = 2 \sqrt{ G^2 + \epsilon^2 }$, $C_2 = \frac{ \alpha D L }{ 2 \left( 1 - \beta_2 \right) } + \frac{2 D G}{ \sqrt{ 1 - \beta_2 } }$, and $f_0 = f \left( {\bm{\theta}}_0 \right)$.

Figures (7)

  • Figure 1: Performance comparison between Adam, AMSGrad and ADOPT in a simple univariate convex optimization problem. The plots show transitions of the parameter value, which should converge to the solution $\theta = -1$.
  • Figure 2: Accuracy for training data (left) and test data(right) in MNIST classification. The error bars show the 95% confidence intervals of three trials.
  • Figure 3: Ablation study of algorithmic changes between Adam and ADOPT. "DE" and CO denote "decorrelation" and "change of order", respectively.
  • Figure 4: Learning curves of test accuracy for CIFAR-10 classification by ResNet-18 trained with Adam and ADOPT.
  • Figure 5: Learning curves of GPT-2 pretraining for training set (left) and validation set (right).
  • ...and 2 more figures

Theorems & Definitions (39)

  • Theorem 3.1
  • proof : Sketch of proof
  • Theorem 4.1
  • Theorem E.1
  • Theorem E.2
  • Theorem E.3
  • proof : Proof of Theorems \ref{['thm:adopt_constant']}, \ref{['thm:clipped_adopt_constant']}, \ref{['thm:adopt_deminishing']}, and \ref{['thm:clipped_adopt_deminishing']}
  • Lemma G.1
  • proof
  • Lemma G.2
  • ...and 29 more