Table of Contents
Fetching ...

From Adam to Adam-Like Lagrangians: Second-Order Nonlocal Dynamics

Carlos Heredia

TL;DR

The paper addresses the lack of a principled dynamical understanding of Adam by formulating a second-order, nonlocal continuous-time model with causal memory kernels that capture past-gradient influence. It shows that, as $α\to0$, this accelerated IDE reduces to the established first-order nonlocal Adam flow on fixed horizons away from the initial time, with a quantified perturbation governed by $ρ=\max\{α, (α/(1-β_1))^2, α/(1-β_2)\}$. A Lyapunov-based stability framework yields dissipation and convergence results under standard smoothness assumptions, with PL and KL structures providing exponential or rate-based decay up to $O(ρ)$-dependent neighborhoods, and a nonlocal Lagrangian viewpoint provides an ideal reciprocity-guided variational blueprint for optimizer design. Numerical experiments on Rosenbrock-type landscapes validate the model, showing that the second-order nonlocal dynamics closely tracks discrete Adam and offers improved accuracy in the small-step regime, while illustrating memory-related behavior such as basin transitions and moment positivity constraints.

Abstract

In this paper, we derive an accelerated continuous-time formulation of Adam by modeling it as a second-order integro-differential dynamical system. We relate this inertial nonlocal model to an existing first-order nonlocal Adam flow through an $α$-refinement limit, and we provide Lyapunov-based stability and convergence analyses. We also introduce an Adam-inspired nonlocal Lagrangian formulation, offering a variational viewpoint. Numerical simulations on Rosenbrock-type examples show agreement between the proposed dynamics and discrete Adam.

From Adam to Adam-Like Lagrangians: Second-Order Nonlocal Dynamics

TL;DR

The paper addresses the lack of a principled dynamical understanding of Adam by formulating a second-order, nonlocal continuous-time model with causal memory kernels that capture past-gradient influence. It shows that, as , this accelerated IDE reduces to the established first-order nonlocal Adam flow on fixed horizons away from the initial time, with a quantified perturbation governed by . A Lyapunov-based stability framework yields dissipation and convergence results under standard smoothness assumptions, with PL and KL structures providing exponential or rate-based decay up to -dependent neighborhoods, and a nonlocal Lagrangian viewpoint provides an ideal reciprocity-guided variational blueprint for optimizer design. Numerical experiments on Rosenbrock-type landscapes validate the model, showing that the second-order nonlocal dynamics closely tracks discrete Adam and offers improved accuracy in the small-step regime, while illustrating memory-related behavior such as basin transitions and moment positivity constraints.

Abstract

In this paper, we derive an accelerated continuous-time formulation of Adam by modeling it as a second-order integro-differential dynamical system. We relate this inertial nonlocal model to an existing first-order nonlocal Adam flow through an -refinement limit, and we provide Lyapunov-based stability and convergence analyses. We also introduce an Adam-inspired nonlocal Lagrangian formulation, offering a variational viewpoint. Numerical simulations on Rosenbrock-type examples show agreement between the proposed dynamics and discrete Adam.
Paper Structure (31 sections, 8 theorems, 238 equations, 10 figures, 2 algorithms)

This paper contains 31 sections, 8 theorems, 238 equations, 10 figures, 2 algorithms.

Key Result

Proposition 1

Let $\alpha>0$ be small and consider the continuous limit $t\approx k\,\alpha$ with the second-order time expansion With the initial data $m^i(0)=\dot m^i(0)=0$ and $v(0)=\dot v(0)=0$, the Adam update admits the inertial ODE with linear friction where the continuous bias-correction factors are The continuous moments are causal convolutions with kernels $K_\beta(s)$ (for $s:=t-\tau\ge 0$) given

Figures (10)

  • Figure 1: Geometric meaning of the translation operator. For a trajectory $\theta:\mathbb{R}\to\mathbb{R}^n$, the map $t\mapsto T_t\theta$ defines an orbit in the space of trajectories, and time-nonlocal models may depend on the full temporal profile $\tau\mapsto \theta(t+\tau)$ rather than only on the local value $\theta(t)$.
  • Figure 2: Rosenbrock-type loss for several values of $c$.
  • Figure 3: Final-time errors between discrete Adam $\theta_K$ (with $K=\lfloor T/\alpha\rfloor$) and the continuous-time model sampled at $t_k=k\alpha$: $E_T(\alpha)=|\theta_K-\theta(T)|$ (left) and $E_{f,T}(\alpha)=|f(\theta_K)-f(\theta(T))|$ (right) for $\beta_1=0.99$ and $\beta_2=0.999$. The legend reports an effective log-log slope $p$ from a fit $E(\alpha)\approx C\alpha^p$. Across step sizes, the inertial second-order model is consistently closer to the discrete iterates, in line with the theory; near the minimizer, $E_{f,T}=\mathcal{O}(E_T^2)$ explains the larger slopes in the right panel.
  • Figure 4: Complete optimization dynamics for $c=4$ with $\alpha=10^{-3}$ and $(\beta_1,\beta_2)=(0.99,0.999)$, initialized at $\theta(0)=2$ and $\dot\theta(0)=u_0=1$. Top row: position $\theta$, velocity $\dot\theta$, and acceleration $\ddot\theta$, illustrating an early inertial transient followed by dissipative relaxation. Middle row: first and second moments $(m,v)$ and the effective adaptive update scale $|m|/(\sqrt{v}+\varepsilon)$. Bottom row: phase portrait $(\theta,\dot\theta)$ (colored by iteration) and kinetic energy $\tfrac{1}{2}|\dot\theta|^2$. The trajectory converges toward the minimizer at $\theta=1$, consistent with the expected basin of attraction for this initialization.
  • Figure 5: Basin-selection transition in the bistable case $c=4$ as the stepsize $\alpha$ varies. We fix $(\beta_1,\beta_2)=(0.99,0.999)$ and initialize at $\theta_0=-1.5$; for the inertial second-order model we set $u_0=1$. The plot reports the final value of $\theta$ as a function of $\alpha$ for discrete Adam and for the second-order continuous-time model. Both exhibit the same threshold-like basin switch: for relatively large $\alpha$ they converge to the global minimizer $\theta=1$, whereas for smaller $\alpha$ the dynamics remains trapped near the local minimizer (here $\theta \approx -0.853$).
  • ...and 5 more figures

Theorems & Definitions (18)

  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • Proposition 2
  • proof
  • Theorem 1
  • proof
  • Corollary 1
  • proof
  • ...and 8 more