On the Performance Analysis of Momentum Method: A Frequency Domain Perspective
Xianliang Li, Jun Luo, Zhiwei Zheng, Hanxiao Wang, Li Luo, Lingkun Wen, Linlong Wu, Sheng Xu
TL;DR
Motivation: momentum in SGD lacks principled, general guidelines for choosing momentum coefficients. Approach: a frequency-domain framework treats momentum as a multistage time-variant filter, with stage transfer $H_k(\omega)=\frac{v_k}{1-u_k e^{-j\omega}}$ and magnitude $|H_k(\omega)|=\frac{|v_k|}{\sqrt{1-2u_k\cos\omega+u_k^2}}$, enabling a unified view of decoupled vs coupled momentum. Key findings: orthodox EMA-SGDM is inherently attenuating, while decoupled Standard-SGDM can both attenuate and amplify frequency bands; high-frequency gradient components are undesirable late in training, while preserving the gradient early and gradually amplifying low-frequency content boosts performance. Impact: the FSGDM optimizer, grounded in this framework, consistently outperforms conventional momentum methods across vision, NLP, and reinforcement learning tasks and offers practical guidance for designing frequency-aware optimizers and extensions to AdaM-family methods.
Abstract
Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.
