Table of Contents
Fetching ...

Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training

Zhifeng Wang, Longlong Li, Chunyan Zeng

TL;DR

The paper tackles the challenge of fluctuating learning efficiency in deep network training by introducing DWMGrad, a dynamic optimizer that adjusts momentum and learning rates through a history-aware, windowed mechanism. By continuously expanding a controllable historical window, DWMGrad coordinates gradient magnitude normalization with momentum smoothing, yielding faster convergence and higher accuracy across computer vision, NLP, and audio tasks, as well as on synthetic benchmarks like the Rosenbrock function. The authors provide theoretical convergence proofs under convexity via a potential function and analyze computational complexity, showing $O(n\cdot d)$ time, comparable to standard optimizers. Empirically, DWMGrad outperforms or matches strong baselines on CIFAR-10/100, ImageNet, GLUE, Core, PubMed, and UrbanSound8K, demonstrating robustness and scalability for large-scale models and diverse data types.

Abstract

Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms' limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad's ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.

Dynamically Weighted Momentum with Adaptive Step Sizes for Efficient Deep Network Training

TL;DR

The paper tackles the challenge of fluctuating learning efficiency in deep network training by introducing DWMGrad, a dynamic optimizer that adjusts momentum and learning rates through a history-aware, windowed mechanism. By continuously expanding a controllable historical window, DWMGrad coordinates gradient magnitude normalization with momentum smoothing, yielding faster convergence and higher accuracy across computer vision, NLP, and audio tasks, as well as on synthetic benchmarks like the Rosenbrock function. The authors provide theoretical convergence proofs under convexity via a potential function and analyze computational complexity, showing time, comparable to standard optimizers. Empirically, DWMGrad outperforms or matches strong baselines on CIFAR-10/100, ImageNet, GLUE, Core, PubMed, and UrbanSound8K, demonstrating robustness and scalability for large-scale models and diverse data types.

Abstract

Within the current sphere of deep learning research, despite the extensive application of optimization algorithms such as Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam), there remains a pronounced inadequacy in their capability to address fluctuations in learning efficiency, meet the demands of complex models, and tackle non-convex optimization issues. These challenges primarily arise from the algorithms' limitations in handling complex data structures and models, for instance, difficulties in selecting an appropriate learning rate, avoiding local optima, and navigating through high-dimensional spaces. To address these issues, this paper introduces a novel optimization algorithm named DWMGrad. This algorithm, building on the foundations of traditional methods, incorporates a dynamic guidance mechanism reliant on historical data to dynamically update momentum and learning rates. This allows the optimizer to flexibly adjust its reliance on historical information, adapting to various training scenarios. This strategy not only enables the optimizer to better adapt to changing environments and task complexities but also, as validated through extensive experimentation, demonstrates DWMGrad's ability to achieve faster convergence rates and higher accuracies under a multitude of scenarios.

Paper Structure

This paper contains 25 sections, 40 equations, 13 figures, 9 tables, 1 algorithm.

Figures (13)

  • Figure 1: Diagram of the DWMGrad optimization framework.
  • Figure 2: Comparison of training loss and test accuracy for the CIFAR-10 datasets on the ResNet 110 and EfficientNet-b0 benchmark model. (a) EfficientNet training process on CIFAR-10 dataset. (b) EfficientNet testing process on CIFAR-10 dataset. (c) ResNet training process on CIFAR-10 dataset. (d) ResNet testing process on CIFAR-10 dataset.
  • Figure 3: Performance comparison on CIFAR-10. (a) EfficientNet-b0 model performance evaluation results for different optimizers. (b) ResNet-110 model performance evaluation results for different optimizers.
  • Figure 4: Comparison of training loss and test accuracy for the CIFAR-100 datasets on the ResNet 110 and EfficientNet-b0 benchmark model. (a) EfficientNet training process on CIFAR-100 dataset. (b) EfficientNet testing process on CIFAR-100 dataset. (c) ResNet training process on CIFAR-100 dataset. (d) ResNet testing process on CIFAR-100 dataset.
  • Figure 5: Performance comparison on CIFAR-100. (a) EfficientNet-b0 model performance evaluation results for different optimizers. (b) ResNet-110 model performance evaluation results for different optimizers.
  • ...and 8 more figures