Signal Processing Meets SGD: From Momentum to Filter
Zhipeng Yao, Rui Yu, Guisong Chang, Ying Li, Yu Zhang, Dazhou Li
TL;DR
The paper tackles the problem that momentum-based SGD struggles to balance bias and variance in gradient estimates during deep learning training. It introduces SGDF, a Wiener-filter–inspired gradient estimator that yields a time-varying gain to produce a refined first-order gradient estimate by fusing current and historical gradients via Gaussian fusion; the update is $\widehat{g}_t = \widehat{m}_t + K_t (g_t - \widehat{m}_t)$. The authors provide convergence analyses for both convex ($R(T) = O(\sqrt{T})$ with $\alpha_t = \alpha/\sqrt{t}$) and non-convex ($\mathbb{E}(T) = O((\log T)/\sqrt{T})$) settings and demonstrate empirically that SGDF achieves faster convergence and better generalization across CNNs, Vision Transformers, and object detection, often outperforming traditional momentum methods and competing with state-of-the-art optimizers. They also show that SGDF can extend to adaptive optimizers (e.g., Adam) to improve generalization, evidenced by reduced Hessian eigenvalues and improved loss landscapes. Overall, SGDF offers a principled, dynamical approach to gradient estimation that reduces noise without sacrificing signal, with practical implications for robust and efficient training in diverse deep-learning tasks.
Abstract
In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization. However, the internal dynamics of these methods remain underexplored. In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence gradient updates and revealing a critical limitation: momentum techniques lack the flexibility to adequately balance bias and variance components in gradients, resulting in gradient estimation inaccuracies. To address this issue, we introduce a novel method SGDF (SGD with Filter) based on Wiener Filter principles, which derives an optimal time-varying gain to refine gradient updates by minimizing the mean square error in gradient estimation. This method yields an optimal first-order gradient estimate, effectively balancing noise reduction and signal preservation. Furthermore, our approach could extend to adaptive optimizers, enhancing their generalization potential. Empirical results show that SGDF achieves superior convergence and generalization compared to traditional momentum methods, and performs competitively with state-of-the-art optimizers.
