Table of Contents
Fetching ...

The Marginal Value of Momentum for Small Learning Rate SGD

Runzhe Wang, Sadhika Malladi, Tianhao Wang, Kaifeng Lyu, Zhiyuan Li

TL;DR

The work analyzes momentum in stochastic gradient methods under small learning-rate conditions, showing that SGDM behaves similarly to SGD when gradient noise dominates. By coupling SGDM trajectories and using weak-approximation and slow-SDE analyses, the authors show that, over O(1/η) and O(1/η^2) horizons, momentum does not confer meaningful acceleration or generalization advantages in typical training regimes. Empirical results on ImageNet, CIFAR-10, and language-model fine-tuning corroborate the theory, indicating momentum’s benefits are limited except in regimes with very large learning rates or batch sizes where different noise scales apply. These findings have practical implications for reducing hyperparameter search and for memory-saving training, since momentum buffers add substantial memory cost without reliably improving performance in common stochastic-noise-dominated settings.

Abstract

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.

The Marginal Value of Momentum for Small Learning Rate SGD

TL;DR

The work analyzes momentum in stochastic gradient methods under small learning-rate conditions, showing that SGDM behaves similarly to SGD when gradient noise dominates. By coupling SGDM trajectories and using weak-approximation and slow-SDE analyses, the authors show that, over O(1/η) and O(1/η^2) horizons, momentum does not confer meaningful acceleration or generalization advantages in typical training regimes. Empirical results on ImageNet, CIFAR-10, and language-model fine-tuning corroborate the theory, indicating momentum’s benefits are limited except in regimes with very large learning rates or batch sizes where different noise scales apply. These findings have practical implications for reducing hyperparameter search and for memory-saving training, since momentum buffers add substantial memory cost without reliably improving performance in common stochastic-noise-dominated settings.

Abstract

Momentum is known to accelerate the convergence of gradient descent in strongly convex settings without stochastic gradient noise. In stochastic optimization, such as training neural networks, folklore suggests that momentum may help deep learning optimization by reducing the variance of the stochastic gradient update, but previous theoretical analyses do not find momentum to offer any provable acceleration. Theoretical results in this paper clarify the role of momentum in stochastic settings where the learning rate is small and gradient noise is the dominant source of instability, suggesting that SGD with and without momentum behave similarly in the short and long time horizons. Experiments show that momentum indeed has limited benefits for both optimization and generalization in practical training regimes where the optimal learning rate is not very large, including small- to medium-batch training from scratch on ImageNet and fine-tuning language models on downstream tasks.
Paper Structure (38 sections, 42 theorems, 180 equations, 3 figures, 2 tables)

This paper contains 38 sections, 42 theorems, 180 equations, 3 figures, 2 tables.

Key Result

Proposition 2.4

Given ${\bm{z}}_k$, the expected change of loss in the next step is

Figures (3)

  • Figure 1: SGDM performs comparably to SGD in training ResNet-50 on ImageNet with smaller batch sizes (e.g., 1024), and outperforms SGD significantly at larger batch sizes.
  • Figure 2: Standard SGDM achieves higher test performance than SGD (see $\ell=1$), but the two trajectories get closer when reducing the curvature-induced term with SVAG (i.e., increasing the value of $\ell$, see Definition \ref{['def:svag']} and Lemma \ref{['lem:descent']}). These experiments confirm our theoretical findings that SGD and SGDM approximate each other when the gradient noise is the primary source of instability. We use batch size $B=512$ with two learning rate decays by a factor of $0.1$ at epochs $80$ and $120$. We grid search to find the best learning rate for SGDM ($\eta=0.2$) and then use it to run SGD and SGDM with SVAG. We use $\beta=0.9$ for SGDM. Additional experimental details are in the appendix.
  • Figure 3: SGD and SGDM trajectories when fine-tuning RoBERTa-large on five downstream tasks. We ensure the effective learning rate is fixed in both cases, so SGDM trajectories are with learning rate 0.001 and SGD trajectories are with learning rate 0.01. We fix the data seed and fine-tune using five different optimization seeds. The results show that SGD and SGDM track each other closely on average over the course of language model fine-tuning.

Theorems & Definitions (86)

  • Definition 2.1: NGOS, malladi2022sdes
  • Definition 2.2: Vanilla SGD
  • Definition 2.3: SGD with Momentum/SGDM
  • Proposition 2.4: Descent Lemma for SGD
  • Definition 3.1: Order-$\gamma$ Weak Approximation
  • Definition 3.2
  • Theorem 3.5: Weak Approximation of SGDM by SGD
  • Theorem 4.5
  • Definition 5.1: SVAG
  • Definition B.1: Standard formulation of SGD with momentum
  • ...and 76 more