Table of Contents
Fetching ...

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

Xianliang Li, Jun Luo, Zhiwei Zheng, Hanxiao Wang, Li Luo, Lingkun Wen, Linlong Wu, Sheng Xu

TL;DR

Motivation: momentum in SGD lacks principled, general guidelines for choosing momentum coefficients. Approach: a frequency-domain framework treats momentum as a multistage time-variant filter, with stage transfer $H_k(\omega)=\frac{v_k}{1-u_k e^{-j\omega}}$ and magnitude $|H_k(\omega)|=\frac{|v_k|}{\sqrt{1-2u_k\cos\omega+u_k^2}}$, enabling a unified view of decoupled vs coupled momentum. Key findings: orthodox EMA-SGDM is inherently attenuating, while decoupled Standard-SGDM can both attenuate and amplify frequency bands; high-frequency gradient components are undesirable late in training, while preserving the gradient early and gradually amplifying low-frequency content boosts performance. Impact: the FSGDM optimizer, grounded in this framework, consistently outperforms conventional momentum methods across vision, NLP, and reinforcement learning tasks and offers practical guidance for designing frequency-aware optimizers and extensions to AdaM-family methods.

Abstract

Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

TL;DR

Motivation: momentum in SGD lacks principled, general guidelines for choosing momentum coefficients. Approach: a frequency-domain framework treats momentum as a multistage time-variant filter, with stage transfer and magnitude , enabling a unified view of decoupled vs coupled momentum. Key findings: orthodox EMA-SGDM is inherently attenuating, while decoupled Standard-SGDM can both attenuate and amplify frequency bands; high-frequency gradient components are undesirable late in training, while preserving the gradient early and gradually amplifying low-frequency content boosts performance. Impact: the FSGDM optimizer, grounded in this framework, consistently outperforms conventional momentum methods across vision, NLP, and reinforcement learning tasks and offers practical guidance for designing frequency-aware optimizers and extensions to AdaM-family methods.

Abstract

Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

Paper Structure

This paper contains 33 sections, 1 theorem, 15 equations, 15 figures, 9 tables, 3 algorithms.

Key Result

Proposition 1

By fixing the number of stages $N$ and the scaling factor $c$, the dynamic magnitude response of Algorithm algo:fsgdm keeps invariant with respect to changes in the total number of training steps.

Figures (15)

  • Figure 1: Visualization of different filters towards the noisy sinusoidal signal. (a) $u_{k}=0\rightarrow 1, v_{k}=1-u_{k}$, with the system gradually shifting from an all-pass filter to a narrow low-pass filter; (b) $u_{k}=0\rightarrow -1, v_{k}=1+u_{k}$, with the system gradually shifting from an all-pass filter to a narrow high-pass filter; (c) $u_{k}=0.9, v_{k}=1$, which indicates the momentum behaves like a low-pass gain filter with amplification on low-frequency gradient components; (d) $u_{k}=-0.9, v_{k}=1$, which indicates the momentum behaves like a high-pass gain filter with amplification on high-frequency components. The amplifying and attenuating effects of different momentum systems are verified.
  • Figure 2: (Up) Analysis of the (dynamic) magnitude responses in the early and late training stages for EMA-SGDM with low-pass momentum defined in Equation \ref{['equ:orthodox_momentum']}. The solid lines denote the magnitude responses in the early stages, and the dashed lines denote the magnitude responses in the late stages. (Down) The comparison between the gradient norms and momentum norms for EMA-SGDM with low-pass momentum. Left Column: increasing sequence. Middle Column: fixed sequence. Right Column: decreasing sequence.
  • Figure 3: (Up) Analysis of the (dynamic) magnitude responses in the early and late training stages for Standard-SGDM with low-pass gain momentum defined in Equation \ref{['equ:unorthodox_momentum']}. The solid lines denote the magnitude responses in the early stages, and the dashed lines denote the magnitude responses in the late stages. (Down) The comparison between the gradient norms and momentum norms for Standard-SGDM with low-pass gain momentum. Left Column: increasing sequence. Middle Column: fixed sequence. Right Column: decreasing sequence.
  • Figure 4: The Top-1 test errors of training ResNet18 on CIFAR-10, ResNet34 on Tiny-ImageNet and ResNet50 on CIFAR-100. The results show that the optimal parameter selections across these three training settings exhibit a high similarity. The black points denote the parameter selections with better test performance. The optimal zone of the parameter selection is circled in red.
  • Figure 5: The reward curves of EMA-, Standard-SGDM, and FSGDM on three MuJoCo tasks.
  • ...and 10 more figures

Theorems & Definitions (1)

  • Proposition 1