Table of Contents
Fetching ...

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh, Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha

TL;DR

Normalization-induced scale invariance can interact with momentum-based optimizers to rapidly erode effective learning rates on the normalized weight space, potentially hampering performance. The authors propose SGDP and AdamP, projection-based updates that remove the radial component while preserving update directions, thereby slowing norm growth without altering directions in the effective space. The methods demonstrate consistent improvements across a wide range of tasks, including vision, language, audio, and retrieval, with robustness to weight decay and competitive computational cost. This work offers a practical fix for a ubiquitous interaction between normalization, scale invariance, and momentum in modern deep networks.

Abstract

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

TL;DR

Normalization-induced scale invariance can interact with momentum-based optimizers to rapidly erode effective learning rates on the normalized weight space, potentially hampering performance. The authors propose SGDP and AdamP, projection-based updates that remove the radial component while preserving update directions, thereby slowing norm growth without altering directions in the effective space. The methods demonstrate consistent improvements across a wide range of tasks, including vision, language, audio, and retrieval, with robustness to weight decay and competitive computational cost. This work offers a practical fix for a ubiquitous interaction between normalization, scale invariance, and momentum in modern deep networks.

Abstract

Normalization techniques are a boon for modern deep learning. They let weights converge more quickly with often better generalization performances. It has been argued that the normalization-induced scale invariance among the weights provides an advantageous ground for gradient descent (GD) optimizers: the effective step sizes are automatically reduced over time, stabilizing the overall training procedure. It is often overlooked, however, that the additional introduction of momentum in GD optimizers results in a far more rapid reduction in effective step sizes for scale-invariant weights, a phenomenon that has not yet been studied and may have caused unwanted side effects in the current practice. This is a crucial issue because arguably the vast majority of modern deep neural networks consist of (1) momentum-based GD (e.g. SGD or Adam) and (2) scale-invariant parameters. In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances. We propose a simple and effective remedy, SGDP and AdamP: get rid of the radial component, or the norm-increasing direction, at each optimizer step. Because of the scale invariance, this modification only alters the effective step sizes without changing the effective update directions, thus enjoying the original convergence properties of GD optimizers. Given the ubiquity of momentum GD and scale invariance in machine learning, we have evaluated our methods against the baselines on 13 benchmarks. They range from vision tasks like classification (e.g. ImageNet), retrieval (e.g. CUB and SOP), and detection (e.g. COCO) to language modelling (e.g. WikiText) and audio classification (e.g. DCASE) tasks. We verify that our solution brings about uniform gains in those benchmarks. Source code is available at https://github.com/clovaai/AdamP.

Paper Structure

This paper contains 50 sections, 7 theorems, 26 equations, 14 figures, 20 tables, 2 algorithms.

Key Result

Lemma 2.1

For a s.i.p. $\boldsymbol{w}$ and the vanilla GD, where ${\boldsymbol{p}}_t = \nabla_{\boldsymbol{w}} f({\boldsymbol{w}}_{t})$,

Figures (14)

  • Figure 1: Optimizer trajectories. Shown is the $\bm{w}_t$ for the optimization problem $\max_{\bm{w}}\frac{\bm{w}^\top\bm{w}^\star}{\| \bm{w} \|_2\|\bm{w}^\star\|_2}s$. Trajectories start from $\bm{w}_0$ towards the optimal solution $\bm{w^\star}$. The problem is invariant to the scale of $\bm{w}$. Video version in the attached code.
  • Figure 2: Vector directions of the gradient, momentum, and ours.
  • Figure 3: 3D scale-invariant Rosenbrock. Three optimization algorithms are compared. Upper row: loss surface and optimization steps. Lower row: norm $r$ of parameters over the iterations. Results for Adam variants in Appendix §\ref{['subsec:3d-for-adam']}.
  • Figure 4: Adversarial training. Learning curves by Adam and AdamP.
  • Figure :
  • ...and 9 more figures

Theorems & Definitions (10)

  • Lemma 2.1: Norm growth by GD, Lemma 2.4 in arora2018theoretical
  • Lemma 2.2: Norm growth by momentum
  • Corollary 2.3: Asymptotic norm growth comparison
  • Proposition 3.1: Effective update direction after projection
  • Lemma A.1: Monotonic norm growth by the momentum
  • proof
  • Corollary A.2: Asymptotic norm growth comparison
  • proof
  • Proposition A.3: Effective update direction after projection
  • proof