Dynamic Momentum Recalibration in Online Gradient Learning

Zhipeng Yao; Rui Yu; Guisong Chang; Ying Li; Yu Zhang; Dazhou Li

Dynamic Momentum Recalibration in Online Gradient Learning

Zhipeng Yao, Rui Yu, Guisong Chang, Ying Li, Yu Zhang, Dazhou Li

TL;DR

This work reinterprets gradient updates through the lens of signal processing and reveals that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates.

Abstract

Stochastic Gradient Descent (SGD) and its momentum variants form the backbone of deep learning optimization, yet the underlying dynamics of their gradient behavior remain insufficiently understood. In this work, we reinterpret gradient updates through the lens of signal processing and reveal that fixed momentum coefficients inherently distort the balance between bias and variance, leading to skewed or suboptimal parameter updates. To address this, we propose SGDF (SGD with Filter), an optimizer inspired by the principles of Optimal Linear Filtering. SGDF computes an online, time-varying gain to dynamically refine gradient estimation by minimizing the mean-squared error, thereby achieving an optimal trade-off between noise suppression and signal preservation. Furthermore, our approach could extend to other optimizers, showcasing its broad applicability to optimization frameworks. Extensive experiments across diverse architectures and benchmarks demonstrate SGDF surpasses conventional momentum methods and achieves performance on par with or surpassing state-of-the-art optimizers.

Dynamic Momentum Recalibration in Online Gradient Learning

TL;DR

Abstract

Paper Structure (42 sections, 15 theorems, 155 equations, 13 figures, 17 tables, 1 algorithm)

This paper contains 42 sections, 15 theorems, 155 equations, 13 figures, 17 tables, 1 algorithm.

Introduction
The Gradient Estimation Dilemma
Bias and Variance
Method
SGDF General Introduction
Fusion of Gaussian Distributions
Convex and Non-convex Convergence Analysis
Experiments
Empirical Evaluation
Extensibility of Filter-Estimated Gradients
Top Eigenvalues of Hessian and Hessian Trace
Related Works
Discussion and Future Work
Conclusion
Bias-Variance Decomposition (Section 2 in main paper)
...and 27 more sections

Key Result

Lemma 2.2

For any gradient estimator $\hat{g}_t = \mathcal{A}(g_1,...,g_t)$, the estimation of the mean square error decomposes as:

Figures (13)

Figure 1: Test accuracy ([$\mu \pm \sigma$]) on CIFAR.
Figure 2: Convergence comparison between Sign SGDF and Adam.
Figure 3: Histogram of Top 50 Hessian Eigenvalues. Lower values indicate better performance on the test dataset.
Figure 4: Training (top row) and test (bottom row) accuracy of CNNs on CIFAR-10 dataset. We report confidence interval ([$\mu \pm \sigma$]) of 3 independent runs.
Figure 5: Training (top row) and test (bottom row) accuracy of CNNs on CIFAR-100 dataset. We report confidence interval ([$\mu \pm \sigma$]) of 3 independent runs.
...and 8 more figures

Theorems & Definitions (32)

Definition 2.1
Lemma 2.2
Theorem 2.3
Theorem 3.1: Convergence in Convex Optimization
Theorem 3.2
Definition A.1
Lemma A.3: Bias-Variance Decomposition
proof
Lemma A.4
proof
...and 22 more

Dynamic Momentum Recalibration in Online Gradient Learning

TL;DR

Abstract

Dynamic Momentum Recalibration in Online Gradient Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (32)