LaProp: Separating Momentum and Adaptivity in Adam
Liu Ziyin, Zhikang T. Wang, Masahito Ueda
TL;DR
LaProp identifies a fundamental coupling between momentum and adaptivity in Adam-inspired optimizers that can cause instability. By decoupling momentum (K_t = I) from adaptivity, LaProp offers a stable, flexible optimizer that interpolates between adaptive methods and signed gradient methods. Theoretical analysis provides bounded updates and a convex regret bound, while extensive experiments across synthetic tasks, style transfer, deep nets, transformers, and Atari reinforcement learning show consistent speed and stability advantages over Adam. This decoupling enables easier hyperparameter tuning and robustness in noisy or complex training regimes, with practical benefits for large-scale neural network optimization.
Abstract
We identity a by-far-unrecognized problem of Adam-style optimizers which results from unnecessary coupling between momentum and adaptivity. The coupling leads to instability and divergence when the momentum and adaptivity parameters are mismatched. In this work, we propose a method, Laprop, which decouples momentum and adaptivity in the Adam-style methods. We show that the decoupling leads to greater flexibility in the hyperparameters and allows for a straightforward interpolation between the signed gradient methods and the adaptive gradient methods. We experimentally show that Laprop has consistently improved speed and stability over Adam on a variety of tasks. We also bound the regret of Laprop on a convex problem and show that our bound differs from that of Adam by a key factor, which demonstrates its advantage.
