ODE approximation for the Adam algorithm: General and overparametrized setting
Steffen Dereich, Arnulf Jentzen, Sebastian Kassing
TL;DR
This work investigates the Adam optimizer through an ODE-based lens in a fast-slow scaling regime with fixed damping and vanishing step-sizes. It shows that Adam updates track the semiflow of a dedicated Adam vector field $f$, making the iterates asymptotic pseudo-trajectories of the flow $\dot{\Psi}_t=f(\Psi_t)$. Consequently, any convergent limit must be a zero of $f$, while in overparameterized empirical risk settings the objective can serve as a Lyapunov function near global minima, guiding convergence to the global minima set when the trajectory repeatedly enters that region. The authors also address degenerate noise in ERM, proving convergence toward the minimizer set under moment and growth conditions, and establish convergence to global minima for minibatch schemes under a Polyak–Łojasiewicz condition, supported by a combinatorial bound on small-gradient events. Overall, the paper provides a rigorous, general ODE-based foundation for understanding Adam’s long-run dynamics in stochastic and overparameterized contexts, clarifying the roles of noise, time-scaling, and minibatching in convergence behavior.
Abstract
The Adam optimizer is currently presumably the most popular optimization method in deep learning. In this article we develop an ODE based method to study the Adam optimizer in a fast-slow scaling regime. For fixed momentum parameters and vanishing step-sizes, we show that the Adam algorithm is an asymptotic pseudo-trajectory of the flow of a particular vector field, which is referred to as the Adam vector field. Leveraging properties of asymptotic pseudo-trajectories, we establish convergence results for the Adam algorithm. In particular, in a very general setting we show that if the Adam algorithm converges, then the limit must be a zero of the Adam vector field, rather than a local minimizer or critical point of the objective function. In contrast, in the overparametrized empirical risk minimization setting, the Adam algorithm is able to locally find the set of minima. Specifically, we show that in a neighborhood of the global minima, the objective function serves as a Lyapunov function for the flow induced by the Adam vector field. As a consequence, if the Adam algorithm enters a neighborhood of the global minima infinitely often, it converges to the set of global minima.
