Table of Contents
Fetching ...

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

Abulikemu Abuduweili, Changliu Liu

TL;DR

Adaptive gradient optimization, especially Adam, can be unstable in early training due to sign-descent caused by $v_0=0$. The authors propose two non-zero initializations, data-driven $v_{0,data}$ and random $v_{0,rnd}$, to mitigate drift in the second-moment estimate and stabilize updates, supported by theoretical drift analysis and extensive experiments across CNNs, LSTMs, Transformers, and GANs. They show that non-zero initialization yields more stable convergence, flatter loss landscapes, and improved final performance, sometimes rivaling or surpassing newer adaptive optimizers, while reducing or eliminating the need for warmup in some settings. The method is simple to implement, computationally cheap, and broadly applicable to related optimizers, with code provided for reproducibility.

Abstract

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation ($v_0 =0$) as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the proposed initialization strategies, Adam achieves performance comparable to many recently proposed variants of adaptive gradient optimization methods. Our code is available at https://github.com/Walleclipse/Adam_Initialization.

Revisiting the Initial Steps in Adaptive Gradient Descent Optimization

TL;DR

Adaptive gradient optimization, especially Adam, can be unstable in early training due to sign-descent caused by . The authors propose two non-zero initializations, data-driven and random , to mitigate drift in the second-moment estimate and stabilize updates, supported by theoretical drift analysis and extensive experiments across CNNs, LSTMs, Transformers, and GANs. They show that non-zero initialization yields more stable convergence, flatter loss landscapes, and improved final performance, sometimes rivaling or surpassing newer adaptive optimizers, while reducing or eliminating the need for warmup in some settings. The method is simple to implement, computationally cheap, and broadly applicable to related optimizers, with code provided for reproducibility.

Abstract

Adaptive gradient optimization methods, such as Adam, are prevalent in training deep neural networks across diverse machine learning tasks due to their ability to achieve faster convergence. However, these methods often suffer from suboptimal generalization compared to stochastic gradient descent (SGD) and exhibit instability, particularly when training Transformer models. In this work, we show the standard initialization of the second-order moment estimation () as a significant factor contributing to these limitations. We introduce simple yet effective solutions: initializing the second-order moment estimation with non-zero values, using either data-driven or random initialization strategies. Empirical evaluations demonstrate that our approach not only stabilizes convergence but also enhances the final performance of adaptive gradient optimizers. Furthermore, by adopting the proposed initialization strategies, Adam achieves performance comparable to many recently proposed variants of adaptive gradient optimization methods. Our code is available at https://github.com/Walleclipse/Adam_Initialization.

Paper Structure

This paper contains 23 sections, 16 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Training Transformers on the IWSLT’14 De-En dataset.
  • Figure 2: Histogram of update step distribution across coordinates.
  • Figure 3: Optimization of the saddle objective function with different methods.
  • Figure 4: Comparison of Vanilla Adam and Adam $v_{0,rnd}$ on (a) CIFAR-10 image classification task. (b) Penn Treebank language modeling task. (c) IWSTL'14 machine translation task.
  • Figure 5: Comparison of the loss landscape around the convergent points of Transformer trained by vanilla Adam and Adam $v_{0,rnd}$.
  • ...and 2 more figures