Table of Contents
Fetching ...

LaProp: Separating Momentum and Adaptivity in Adam

Liu Ziyin, Zhikang T. Wang, Masahito Ueda

TL;DR

LaProp identifies a fundamental coupling between momentum and adaptivity in Adam-inspired optimizers that can cause instability. By decoupling momentum (K_t = I) from adaptivity, LaProp offers a stable, flexible optimizer that interpolates between adaptive methods and signed gradient methods. Theoretical analysis provides bounded updates and a convex regret bound, while extensive experiments across synthetic tasks, style transfer, deep nets, transformers, and Atari reinforcement learning show consistent speed and stability advantages over Adam. This decoupling enables easier hyperparameter tuning and robustness in noisy or complex training regimes, with practical benefits for large-scale neural network optimization.

Abstract

We identity a by-far-unrecognized problem of Adam-style optimizers which results from unnecessary coupling between momentum and adaptivity. The coupling leads to instability and divergence when the momentum and adaptivity parameters are mismatched. In this work, we propose a method, Laprop, which decouples momentum and adaptivity in the Adam-style methods. We show that the decoupling leads to greater flexibility in the hyperparameters and allows for a straightforward interpolation between the signed gradient methods and the adaptive gradient methods. We experimentally show that Laprop has consistently improved speed and stability over Adam on a variety of tasks. We also bound the regret of Laprop on a convex problem and show that our bound differs from that of Adam by a key factor, which demonstrates its advantage.

LaProp: Separating Momentum and Adaptivity in Adam

TL;DR

LaProp identifies a fundamental coupling between momentum and adaptivity in Adam-inspired optimizers that can cause instability. By decoupling momentum (K_t = I) from adaptivity, LaProp offers a stable, flexible optimizer that interpolates between adaptive methods and signed gradient methods. Theoretical analysis provides bounded updates and a convex regret bound, while extensive experiments across synthetic tasks, style transfer, deep nets, transformers, and Atari reinforcement learning show consistent speed and stability advantages over Adam. This decoupling enables easier hyperparameter tuning and robustness in noisy or complex training regimes, with practical benefits for large-scale neural network optimization.

Abstract

We identity a by-far-unrecognized problem of Adam-style optimizers which results from unnecessary coupling between momentum and adaptivity. The coupling leads to instability and divergence when the momentum and adaptivity parameters are mismatched. In this work, we propose a method, Laprop, which decouples momentum and adaptivity in the Adam-style methods. We show that the decoupling leads to greater flexibility in the hyperparameters and allows for a straightforward interpolation between the signed gradient methods and the adaptive gradient methods. We experimentally show that Laprop has consistently improved speed and stability over Adam on a variety of tasks. We also bound the regret of Laprop on a convex problem and show that our bound differs from that of Adam by a key factor, which demonstrates its advantage.

Paper Structure

This paper contains 34 sections, 3 theorems, 26 equations, 16 figures, 1 table, 4 algorithms.

Key Result

Proposition 1

Bound for LaProp update. Let $m_t$ be defined as in Algorithm alg:simple laprop, and set $c_n=1 - \nu^t$, $c_m = 1 - \mu^t$. Then the magnitude of the update is bounded from above as $| \frac{m_t}{c_m}| \leq \frac{1}{\sqrt{1-\nu}}$.

Figures (16)

  • Figure 1: Divergence of Adam on a two-layer ReLU network trained on MNIST with $\mu=0.9,\ \nu=0.7$. In contrast, LaProp is always stable.
  • Figure 2: Time it takes to converge on the noisy Rosenbrock task plotted against $\nu$, with $\sigma$ being the noise level. (a) When the noise is small, the optimization speed of LaProp is almost invariant w.r.t. $\nu$, demonstrating its flexibility in hyperparameters compared with the other optimizers; (b) when the noise gets larger, the performance of Adam and AMSGrad decreases, and they cannot work in the small $\nu$ regime where LaProp has its best performance; (c, d) For $\sigma \geq 0.12$, only LaProp converges even if we lengthen the optimization to $10000$ steps. Data points are plotted at equal intervals for all the curves, and we see LaProp is much stabler. Results for different learning rates for $\sigma=0.10$ are shown in the appendix.
  • Figure 3: Neural style transfer with different optimizers. (a) The average regret $R(T)/T$ at $T=1000$ plotted against $\nu$. A lower value corresponds to a better convergence rate journals/corr/KingmaB14_adamReddi2018convergence. (b) Example optimization curves of different optimizers for the first 120 updates.
  • Figure 4: Training curves of deep FC networks.
  • Figure 5: Learning curves of the transformer tasks. When there is a warmup, the learning rate linearly increases from zero to the maximum and then decreases; otherwise it starts from the maximum and decreases. The warmup includes the first $2\times 10^3$ updates in (a), and $10\times 10^3$ updates in (b).
  • ...and 11 more figures

Theorems & Definitions (3)

  • Proposition 1
  • Proposition 2
  • Theorem 1