Table of Contents
Fetching ...

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

Yusu Hong, Junhong Lin

TL;DR

This work provides a theoretical convergence analysis for vanilla Adam in non-convex stochastic optimization under relaxed noise and smoothness assumptions. By introducing a proxy step-size and a descent-decomposition framework, the authors prove that Adam converges to a stationary point with high probability at a rate of $O\left(\text{poly}(\log T)/\sqrt{T}\right)$, without tuning step-sizes to problem parameters and under a broad noise model that includes affine-variance, bounded, and sub-Gaussian cases. The analysis extends to the generalized $(L_0,L_q)$-smooth setting, yielding similar rates despite unbounded smoothness, supported by a detailed probabilistic and deterministic estimation scheme. Collectively, the results offer a rigorous justification for Adam’s robustness and adaptive behavior in practical deep learning tasks, while also highlighting avenues for future refinements and empirical validation.

Abstract

The Adaptive Momentum Estimation (Adam) algorithm is highly effective in training various deep learning tasks. Despite this, there's limited theoretical understanding for Adam, especially when focusing on its vanilla form in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. In this paper, we study vanilla Adam under these challenging conditions. We introduce a comprehensive noise model which governs affine variance noise, bounded noise and sub-Gaussian noise. We show that Adam can find a stationary point with a $\mathcal{O}(\text{poly}(\log T)/\sqrt{T})$ rate in high probability under this general noise model where $T$ denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors. More importantly, we reveal that Adam is free of tuning step-sizes with any problem-parameters, yielding a better adaptation property than the Stochastic Gradient Descent under the same conditions. We also provide a probabilistic convergence result for Adam under a generalized smooth condition which allows unbounded smoothness parameters and has been illustrated empirically to more accurately capture the smooth property of many practical objective functions.

On Convergence of Adam for Stochastic Optimization under Relaxed Assumptions

TL;DR

This work provides a theoretical convergence analysis for vanilla Adam in non-convex stochastic optimization under relaxed noise and smoothness assumptions. By introducing a proxy step-size and a descent-decomposition framework, the authors prove that Adam converges to a stationary point with high probability at a rate of , without tuning step-sizes to problem parameters and under a broad noise model that includes affine-variance, bounded, and sub-Gaussian cases. The analysis extends to the generalized -smooth setting, yielding similar rates despite unbounded smoothness, supported by a detailed probabilistic and deterministic estimation scheme. Collectively, the results offer a rigorous justification for Adam’s robustness and adaptive behavior in practical deep learning tasks, while also highlighting avenues for future refinements and empirical validation.

Abstract

The Adaptive Momentum Estimation (Adam) algorithm is highly effective in training various deep learning tasks. Despite this, there's limited theoretical understanding for Adam, especially when focusing on its vanilla form in non-convex smooth scenarios with potential unbounded gradients and affine variance noise. In this paper, we study vanilla Adam under these challenging conditions. We introduce a comprehensive noise model which governs affine variance noise, bounded noise and sub-Gaussian noise. We show that Adam can find a stationary point with a rate in high probability under this general noise model where denotes total number iterations, matching the lower rate of stochastic first-order algorithms up to logarithm factors. More importantly, we reveal that Adam is free of tuning step-sizes with any problem-parameters, yielding a better adaptation property than the Stochastic Gradient Descent under the same conditions. We also provide a probabilistic convergence result for Adam under a generalized smooth condition which allows unbounded smoothness parameters and has been illustrated empirically to more accurately capture the smooth property of many practical objective functions.
Paper Structure (62 sections, 35 theorems, 269 equations, 1 table, 2 algorithms)

This paper contains 62 sections, 35 theorems, 269 equations, 1 table, 2 algorithms.

Key Result

Theorem 3.1

Let $T \ge 1$ and $\{\bm{x}_s\}_{s \in [T]}$ be the sequence generated by Algorithm alg:Adam. If Assumptions (A1)-(A3) hold, and the hyper-parameters satisfy that for some constants $c,C_0> 0$ and $\epsilon_0 > 0$, then for any given $\delta \in (0,1/2)$, it holds that with probability at least $1-2\delta$, where $G^2$ is defined by the following order with respect to $T,\epsilon_0,\delta$:The d

Theorems & Definitions (68)

  • Theorem 3.1
  • Theorem 3.2: informal version of \ref{['thm:no_corrective']}
  • Theorem 4.1
  • Theorem 4.2: informal version of \ref{['thm:tgeneral_smooth']}
  • Lemma A.1
  • proof
  • Lemma A.2
  • proof
  • Lemma A.3
  • Lemma B.1
  • ...and 58 more