Signal Processing Meets SGD: From Momentum to Filter

Zhipeng Yao; Rui Yu; Guisong Chang; Ying Li; Yu Zhang; Dazhou Li

Signal Processing Meets SGD: From Momentum to Filter

Zhipeng Yao, Rui Yu, Guisong Chang, Ying Li, Yu Zhang, Dazhou Li

TL;DR

The paper tackles the problem that momentum-based SGD struggles to balance bias and variance in gradient estimates during deep learning training. It introduces SGDF, a Wiener-filter–inspired gradient estimator that yields a time-varying gain to produce a refined first-order gradient estimate by fusing current and historical gradients via Gaussian fusion; the update is $\widehat{g}_t = \widehat{m}_t + K_t (g_t - \widehat{m}_t)$. The authors provide convergence analyses for both convex ($R(T) = O(\sqrt{T})$ with $\alpha_t = \alpha/\sqrt{t}$) and non-convex ($\mathbb{E}(T) = O((\log T)/\sqrt{T})$) settings and demonstrate empirically that SGDF achieves faster convergence and better generalization across CNNs, Vision Transformers, and object detection, often outperforming traditional momentum methods and competing with state-of-the-art optimizers. They also show that SGDF can extend to adaptive optimizers (e.g., Adam) to improve generalization, evidenced by reduced Hessian eigenvalues and improved loss landscapes. Overall, SGDF offers a principled, dynamical approach to gradient estimation that reduces noise without sacrificing signal, with practical implications for robust and efficient training in diverse deep-learning tasks.

Abstract

In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization. However, the internal dynamics of these methods remain underexplored. In this paper, we analyze gradient behavior through a signal processing lens, isolating key factors that influence gradient updates and revealing a critical limitation: momentum techniques lack the flexibility to adequately balance bias and variance components in gradients, resulting in gradient estimation inaccuracies. To address this issue, we introduce a novel method SGDF (SGD with Filter) based on Wiener Filter principles, which derives an optimal time-varying gain to refine gradient updates by minimizing the mean square error in gradient estimation. This method yields an optimal first-order gradient estimate, effectively balancing noise reduction and signal preservation. Furthermore, our approach could extend to adaptive optimizers, enhancing their generalization potential. Empirical results show that SGDF achieves superior convergence and generalization compared to traditional momentum methods, and performs competitively with state-of-the-art optimizers.

Signal Processing Meets SGD: From Momentum to Filter

TL;DR

. The authors provide convergence analyses for both convex (

with

) and non-convex (

) settings and demonstrate empirically that SGDF achieves faster convergence and better generalization across CNNs, Vision Transformers, and object detection, often outperforming traditional momentum methods and competing with state-of-the-art optimizers. They also show that SGDF can extend to adaptive optimizers (e.g., Adam) to improve generalization, evidenced by reduced Hessian eigenvalues and improved loss landscapes. Overall, SGDF offers a principled, dynamical approach to gradient estimation that reduces noise without sacrificing signal, with practical implications for robust and efficient training in diverse deep-learning tasks.

Abstract

Paper Structure (33 sections, 14 theorems, 139 equations, 14 figures, 12 tables, 1 algorithm)

This paper contains 33 sections, 14 theorems, 139 equations, 14 figures, 12 tables, 1 algorithm.

Introduction
Related Works
The Gradient Estimation Dilemma
Bias and Variance
Visualization of Gradient Distribution
Method
SGDF General Introduction
Fusion of Gaussian Distributions
Convex and Non-convex Convergence Analysis
Experiments
Empirical Evaluation
Top Eigenvalues of Hessian and Hessian Trace
Wiener Filter combines Adam
Limitations and Future Work
Conclusion
...and 18 more sections

Key Result

Lemma 3.2

For any gradient estimator $\hat{g}_t = \mathcal{A}(g_1,...,g_t)$, the estimation mean square error decomposes as:

Figures (14)

Figure 1: Train the VGG model on the CIFAR-100 dataset using the same initial learning rate of 0.1, and multiply it by a factor of 0.1 at the 150th epoch.
Figure 2: The gradient histogram of the VGG on the CIFAR-100 dataset. The x-axis is the gradient value and the height is the frequency. SGD trains the VGG without BN, the variance of the gradient fluctuates dramatically and the update is unstable.
Figure 3: Test accuracy ([$\mu \pm \sigma$]) on CIFAR.
Figure 4: Histogram of Top 50 Hessian Eigenvalues. Lower values indicate better performance on the test dataset.
Figure 5: Training (top row) and test (bottom row) accuracy of CNNs on CIFAR-10 dataset. We report confidence interval ([$\mu \pm \sigma$]) of 3 independent runs.
...and 9 more figures

Theorems & Definitions (29)

Definition 3.1
Lemma 3.2
Theorem 3.3
Theorem 4.1
Theorem 4.2
Definition A.1
Lemma A.3
proof
Lemma A.4
proof
...and 19 more

Signal Processing Meets SGD: From Momentum to Filter

TL;DR

Abstract

Signal Processing Meets SGD: From Momentum to Filter

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (29)