Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning
Jiawu Tian, Liwei Xu, Xiaowei Zhang, Yongqi Li
TL;DR
The paper tackles convergence challenges of Adam-type optimizers for non-convex deep-learning objectives. It proposes CG-like-Adam, where the update direction is conjugate-gradient-like: $d_t := g_t - \frac{\gamma_t}{t^a} d_{t-1}$ with $a\ge \tfrac{1}{2}$, and both the first- and second-order moments are computed in CG-like form, including unbiased bias-corrected $\hat m_t$ and bounded $\hat v_t$. The authors prove convergence under standard assumptions, including constant $\beta_1$ and unbiased moment estimation, with a rate characterized by $\min_t \mathbb{E}\|\nabla f(x_t)\|^2 = O(S_1(T)/S_2(T))$ and a corollary giving a $O(\log T/\sqrt{T})$ rate for $\alpha_t=\alpha/t^b$, $b\in[1/2,1)$. Empirical results on CIFAR-10/100 using VGG-19 and ResNet-34 show faster convergence, lower training losses, and improved generalization relative to Adam and CoBA, supporting the approach's practical impact for deep learning optimization.
Abstract
Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
