ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

Shohei Taniguchi; Keno Harada; Gouki Minegishi; Yuta Oshima; Seong Cheol Jeong; Go Nagahara; Tomoshi Iiyama; Masahiro Suzuki; Yusuke Iwasawa; Yutaka Matsuo

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

Shohei Taniguchi, Keno Harada, Gouki Minegishi, Yuta Oshima, Seong Cheol Jeong, Go Nagahara, Tomoshi Iiyama, Masahiro Suzuki, Yusuke Iwasawa, Yutaka Matsuo

TL;DR

A new adaptive gradient method named ADOPT is proposed, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $\beta_2$ without depending on the bounded noise assumption.

Abstract

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e., $β_2$, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of $\mathcal{O} ( 1 / \sqrt{T} )$ with any choice of $β_2$ without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

TL;DR

A new adaptive gradient method named ADOPT is proposed, which achieves the optimal convergence rate of

with any choice of

without depending on the bounded noise assumption.

Abstract

Adam is one of the most popular optimization algorithms in deep learning. However, it is known that Adam does not converge in theory unless choosing a hyperparameter, i.e.,

, in a problem-dependent manner. There have been many attempts to fix the non-convergence (e.g., AMSGrad), but they require an impractical assumption that the gradient noise is uniformly bounded. In this paper, we propose a new adaptive gradient method named ADOPT, which achieves the optimal convergence rate of

with any choice of

without depending on the bounded noise assumption. ADOPT addresses the non-convergence issue of Adam by removing the current gradient from the second moment estimate and changing the order of the momentum update and the normalization by the second moment estimate. We also conduct intensive numerical experiments, and verify that our ADOPT achieves superior results compared to Adam and its variants across a wide range of tasks, including image classification, generative modeling, natural language processing, and deep reinforcement learning. The implementation is available at https://github.com/iShohei220/adopt.

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

TL;DR

Abstract

ADOPT: Modified Adam Can Converge with Any $β_2$ with the Optimal Rate

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (39)