Table of Contents
Fetching ...

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

Jiawu Tian, Liwei Xu, Xiaowei Zhang, Yongqi Li

TL;DR

The paper tackles convergence challenges of Adam-type optimizers for non-convex deep-learning objectives. It proposes CG-like-Adam, where the update direction is conjugate-gradient-like: $d_t := g_t - \frac{\gamma_t}{t^a} d_{t-1}$ with $a\ge \tfrac{1}{2}$, and both the first- and second-order moments are computed in CG-like form, including unbiased bias-corrected $\hat m_t$ and bounded $\hat v_t$. The authors prove convergence under standard assumptions, including constant $\beta_1$ and unbiased moment estimation, with a rate characterized by $\min_t \mathbb{E}\|\nabla f(x_t)\|^2 = O(S_1(T)/S_2(T))$ and a corollary giving a $O(\log T/\sqrt{T})$ rate for $\alpha_t=\alpha/t^b$, $b\in[1/2,1)$. Empirical results on CIFAR-10/100 using VGG-19 and ResNet-34 show faster convergence, lower training losses, and improved generalization relative to Adam and CoBA, supporting the approach's practical impact for deep learning optimization.

Abstract

Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

TL;DR

The paper tackles convergence challenges of Adam-type optimizers for non-convex deep-learning objectives. It proposes CG-like-Adam, where the update direction is conjugate-gradient-like: with , and both the first- and second-order moments are computed in CG-like form, including unbiased bias-corrected and bounded . The authors prove convergence under standard assumptions, including constant and unbiased moment estimation, with a rate characterized by and a corollary giving a rate for , . Empirical results on CIFAR-10/100 using VGG-19 and ResNet-34 show faster convergence, lower training losses, and improved generalization relative to Adam and CoBA, supporting the approach's practical impact for deep learning optimization.

Abstract

Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.
Paper Structure (16 sections, 14 theorems, 105 equations, 13 figures, 2 algorithms)

This paper contains 16 sections, 14 theorems, 105 equations, 13 figures, 2 algorithms.

Key Result

Theorem 3.1

Suppose that the assumptions Aass31-Aass34 are satisfied. $\beta_{1t} \in [0,1)$, $\beta_{1t} \leq \beta_{1(t+1)}$, $\beta_{1(t+1)} \leq \beta_{1t} h(t)$ (or $\beta_{1t} h(t) \leq \beta_{1(t+1)}$) hold for all $t \in \mathcal{T}$, in which $h(t)=\frac{(1-\beta_{11}^{t-1})(1-\beta_{11}^{t+1})}{(1-\be where $C_{1}$,$C_{2}$,$C_{3}$ and $C_{4}$ are both constant independent of $T$, $\mu_{t}=\frac{\alp

Figures (13)

  • Figure 1: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, HS(\ref{['eq4']}))
  • Figure 2: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, FR(\ref{['eq5']}))
  • Figure 3: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, PRP(\ref{['eq6']}))
  • Figure 4: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, DY(\ref{['eq7']}))
  • Figure 5: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, HZ(\ref{['eq8']}))
  • ...and 8 more figures

Theorems & Definitions (31)

  • Theorem 3.1
  • proof
  • Theorem 3.2
  • proof
  • Corollary 3.1
  • proof
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • ...and 21 more