Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training
Zhifeng Wang, Longlong Li, Chunyan Zeng
TL;DR
The paper tackles the challenge of fluctuating learning efficiency in deep network training by introducing DWMGrad, a dynamic optimizer that adjusts momentum and learning rates through a history-aware, windowed mechanism. By continuously expanding a controllable historical window, DWMGrad coordinates gradient magnitude normalization with momentum smoothing, yielding faster convergence and higher accuracy across computer vision, NLP, and audio tasks, as well as on synthetic benchmarks like the Rosenbrock function. The authors provide theoretical convergence proofs under convexity via a potential function and analyze computational complexity, showing $O(n\cdot d)$ time, comparable to standard optimizers. Empirically, DWMGrad outperforms or matches strong baselines on CIFAR-10/100, ImageNet, GLUE, Core, PubMed, and UrbanSound8K, demonstrating robustness and scalability for large-scale models and diverse data types.
Abstract
Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms' limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad's ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.
