Adam with model exponential moving average is effective for nonconvex optimization

Kwangjun Ahn; Ashok Cutkosky

Adam with model exponential moving average is effective for nonconvex optimization

Kwangjun Ahn, Ashok Cutkosky

TL;DR

This work demonstrates that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth.

Abstract

In this work, we offer a theoretical analysis of two modern optimization techniques for training large and complex models: (i) adaptive optimization algorithms, such as Adam, and (ii) the model exponential moving average (EMA). Specifically, we demonstrate that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth. Moreover, when the scale varies significantly across different coordinates, we demonstrate that the coordinate-wise adaptivity of Adam is provably advantageous. Notably, unlike previous analyses of Adam, our analysis crucially relies on its core elements -- momentum and discounting factors -- as well as model EMA, motivating their wide applications in practice.

Adam with model exponential moving average is effective for nonconvex optimization

TL;DR

This work demonstrates that a clipped version of Adam with model EMA achieves the optimal convergence rates in various nonconvex optimization settings, both smooth and nonsmooth.

Abstract

Paper Structure (20 sections, 15 theorems, 45 equations, 2 algorithms)

This paper contains 20 sections, 15 theorems, 45 equations, 2 algorithms.

Introduction
Related work
Setting for nonconvex and nonsmooth optimization
Discounted-to-nonconvex conversion: online learning of increments
Scale-free Follow-the-Regularized-Leader (FTRL)
Discounted-FTRL leads to adaptive nonconvex optimization
From gradient-adaptive regret to nonconvex optimization
Optimality and gradient adaptivity
Optimality
Gradient adaptivity
Coordinate-wise adaptivity via (clipped-)Adam
Coordinate-wise discounted FTRL corresponds to (clipped-)Adam
Nonconvex optimization guarantees of \ref{['adam']}
Benefits of coordinate-wise adaptivity of \ref{['adam']}
Discussion
...and 5 more sections

Key Result

Theorem 1

adam with the EMA on its iterates achieves the optimal convergence rate for nonconvex optimization both for smooth and nonsmooth settings (sec:global). The coordinate-wise adaptivity of Adam is particularly effective when the scale varies across different coordinates (sec:coordinate).

Theorems & Definitions (19)

Theorem 1: Informal
definition 3: $(\lambda,\varepsilon)$-stationary point
Lemma 4
Lemma 5
definition 6: Discounted regret
Lemma 7: Discounted-to-nonconvex conversion
Lemma 8: Gradient-adaptive regret bound
Theorem 9: Discounted regret bound
Lemma 10: Variance bound
proof
...and 9 more

Adam with model exponential moving average is effective for nonconvex optimization

TL;DR

Abstract

Adam with model exponential moving average is effective for nonconvex optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (19)