Table of Contents
Fetching ...

Continuous-Time Analysis of Adaptive Optimization and Normalization

Rhys Gould, Hidenori Tanaka

TL;DR

This work advances the theoretical understanding of adaptive optimizers by formulating Adam/AdamW in continuous time, deriving a stability region for hyperparameters $\{\beta,\gamma\}$ that yields bounded updates, and linking scale-invariant architectural components to an implicit meta-adaptive normalization. It then introduces the explicit $2$-Adam and the general $k$-Adam family, which consciously apply adaptive normalization multiple times, motivated by scale-invariance dynamics. Empirical results on transformer- and CNN-based tasks show that operating in the stable region improves generalization and that $2$-Adam can outperform standard Adam in several settings. The study provides principled guidance for hyperparameter choices and architectural decisions, with potential broad applicability beyond the tested architectures and loss functions.

Abstract

Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning. However, their training dynamics lack comprehensive theoretical understanding, with limited insight into why common practices -- such as specific hyperparameter choices and normalization layers -- contribute to successful generalization. This work presents a continuous-time formulation of Adam and AdamW, facilitating a tractable analysis of training dynamics that can shed light on such practical questions. We theoretically derive a stable region for Adam's hyperparameters $(β, γ)$ that ensures bounded updates, empirically verifying these predictions by observing unstable exponential parameter growth outside of this stable region. Furthermore, we theoretically justify the success of normalization layers by uncovering an implicit meta-adaptive effect of scale-invariant architectural components. This insight leads to an explicit optimizer, $2$-Adam, which we generalize to $k$-Adam -- an optimizer that applies an adaptive normalization procedure $k$ times, encompassing Adam (corresponding to $k=1$) and Adam with a normalization layer (corresponding to $k=2$). Overall, our continuous-time formulation of Adam facilitates a principled analysis, offering deeper understanding of optimal hyperparameter choices and architectural decisions in modern deep learning.

Continuous-Time Analysis of Adaptive Optimization and Normalization

TL;DR

This work advances the theoretical understanding of adaptive optimizers by formulating Adam/AdamW in continuous time, deriving a stability region for hyperparameters that yields bounded updates, and linking scale-invariant architectural components to an implicit meta-adaptive normalization. It then introduces the explicit -Adam and the general -Adam family, which consciously apply adaptive normalization multiple times, motivated by scale-invariance dynamics. Empirical results on transformer- and CNN-based tasks show that operating in the stable region improves generalization and that -Adam can outperform standard Adam in several settings. The study provides principled guidance for hyperparameter choices and architectural decisions, with potential broad applicability beyond the tested architectures and loss functions.

Abstract

Adaptive optimization algorithms, particularly Adam and its variant AdamW, are fundamental components of modern deep learning. However, their training dynamics lack comprehensive theoretical understanding, with limited insight into why common practices -- such as specific hyperparameter choices and normalization layers -- contribute to successful generalization. This work presents a continuous-time formulation of Adam and AdamW, facilitating a tractable analysis of training dynamics that can shed light on such practical questions. We theoretically derive a stable region for Adam's hyperparameters that ensures bounded updates, empirically verifying these predictions by observing unstable exponential parameter growth outside of this stable region. Furthermore, we theoretically justify the success of normalization layers by uncovering an implicit meta-adaptive effect of scale-invariant architectural components. This insight leads to an explicit optimizer, -Adam, which we generalize to -Adam -- an optimizer that applies an adaptive normalization procedure times, encompassing Adam (corresponding to ) and Adam with a normalization layer (corresponding to ). Overall, our continuous-time formulation of Adam facilitates a principled analysis, offering deeper understanding of optimal hyperparameter choices and architectural decisions in modern deep learning.

Paper Structure

This paper contains 31 sections, 8 theorems, 75 equations, 16 figures, 1 algorithm.

Key Result

Lemma 1

Up to order $\eta^p$ and for $\beta, \gamma \in (0, 1)$, the continuous-time moving averages $m(t)$ and $v(t)$ satisfy the first-order differential equations with solutions where $g(t) := \nabla_{\theta} L(\theta(t))$ and defining

Figures (16)

  • Figure 1: Continuous-time model closely agrees with discrete-time trajectories. We plot the discrete-time and continuous-time trajectories for 16 randomly chosen parameters from a transformer model.
  • Figure 2: Visualization of level curves $\mathcal{B}_{c}$ (solid lines) and normal curves $\mathcal{C}_{\beta, \gamma}$ (dashed red lines). Level curves are coloured based on their value of $C(\beta, \gamma)$, (i.e. purple has most positive value, yellow most negative). The bounded-update region $\mathcal{B}_{+}$ is highlighted in red.
  • Figure 3: Max-update bound accurately predicts stable region and unstable exponent of divergence. For a range of adaptive hyperparameter values $(\beta, \gamma)$, we plot (a) the max-update $||u_n||_{\infty} \equiv ||\theta_n-\theta_{n-1}||_{\infty} / \eta$ over training iterations $n$, and (b) the slope $d\log||u_n||_{\infty}/dn$ of the log-max-update at iteration $n=1000$ (in order to interpret exponential growth). In (a) we visualize the bounds of \ref{['eqn:maxupd']} as dotted lines, and in (b) we denote the predicted slope/exponent $|C(\beta, \gamma)|/2$ (when $C(\beta, \gamma) < 0$) as a dashed line. We consider $64$ choices for $(\beta, \gamma)$, visualized in (c), taken uniformly along a section of the normal curve $\mathcal{C}_{\tilde{\beta}, \tilde{\gamma}}$ passing through the point $(\tilde{\beta}, \tilde{\gamma}) = (0.999, 0.9)$.
  • Figure 4: Our theory accurately predicts the divergence of test loss across Adam's hyperparameter space. For a range of values $(\beta, \gamma)$, we plot (a) the test loss over training iterations $n$, and (b) the best test loss achieved over the first 1000 and 3000 iterations. We consider $128$ choices for $(\beta, \gamma)$, visualized in (c), taken uniformly along the entire normal curve $\mathcal{C}_{\tilde{\beta}, \tilde{\gamma}}$. The rightmost point in (c) corresponds to $(\tilde{\beta}, \tilde{\gamma}) = (0.999, 0.9)$.
  • Figure 5: Norm approximation agrees with true norm. The true trajectory of $||W||^2$ compared to the approximation (dashed line) of \ref{['thm:finalsquarednorm']} for 16 randomly chosen query/key matrices.
  • ...and 11 more figures

Theorems & Definitions (16)

  • Definition 1: Adam
  • Definition 2: Adaptive normalization
  • Lemma 1: \ref{['app:adamw continuous']}
  • Proposition 1
  • Proposition 2: \ref{['app:adamw continuous']}
  • Theorem 1: \ref{['app:maxupdderivation']}
  • Lemma 2: \ref{['app:symmetries of transformer']}
  • Theorem 2: \ref{['app:direcdiffeq']}
  • Proposition 3: \ref{['app:direcdiffeq']}
  • Definition 3: $k$-Adam
  • ...and 6 more