On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

Xianliang Li; Jun Luo; Zhiwei Zheng; Hanxiao Wang; Li Luo; Lingkun Wen; Linlong Wu; Sheng Xu

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

Xianliang Li, Jun Luo, Zhiwei Zheng, Hanxiao Wang, Li Luo, Lingkun Wen, Linlong Wu, Sheng Xu

TL;DR

Motivation: momentum in SGD lacks principled, general guidelines for choosing momentum coefficients. Approach: a frequency-domain framework treats momentum as a multistage time-variant filter, with stage transfer $H_k(\omega)=\frac{v_k}{1-u_k e^{-j\omega}}$ and magnitude $|H_k(\omega)|=\frac{|v_k|}{\sqrt{1-2u_k\cos\omega+u_k^2}}$, enabling a unified view of decoupled vs coupled momentum. Key findings: orthodox EMA-SGDM is inherently attenuating, while decoupled Standard-SGDM can both attenuate and amplify frequency bands; high-frequency gradient components are undesirable late in training, while preserving the gradient early and gradually amplifying low-frequency content boosts performance. Impact: the FSGDM optimizer, grounded in this framework, consistently outperforms conventional momentum methods across vision, NLP, and reinforcement learning tasks and offers practical guidance for designing frequency-aware optimizers and extensions to AdaM-family methods.

Abstract

Momentum-based optimizers are widely adopted for training neural networks. However, the optimal selection of momentum coefficients remains elusive. This uncertainty impedes a clear understanding of the role of momentum in stochastic gradient methods. In this paper, we present a frequency domain analysis framework that interprets the momentum method as a time-variant filter for gradients, where adjustments to momentum coefficients modify the filter characteristics. Our experiments support this perspective and provide a deeper understanding of the mechanism involved. Moreover, our analysis reveals the following significant findings: high-frequency gradient components are undesired in the late stages of training; preserving the original gradient in the early stages, and gradually amplifying low-frequency gradient components during training both enhance performance. Based on these insights, we propose Frequency Stochastic Gradient Descent with Momentum (FSGDM), a heuristic optimizer that dynamically adjusts the momentum filtering characteristic with an empirically effective dynamic magnitude response. Experimental results demonstrate the superiority of FSGDM over conventional momentum optimizers.

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

TL;DR

Abstract

On the Performance Analysis of Momentum Method: A Frequency Domain Perspective

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (1)