Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

Jiawu Tian; Liwei Xu; Xiaowei Zhang; Yongqi Li

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

Jiawu Tian, Liwei Xu, Xiaowei Zhang, Yongqi Li

TL;DR

The paper tackles convergence challenges of Adam-type optimizers for non-convex deep-learning objectives. It proposes CG-like-Adam, where the update direction is conjugate-gradient-like: $d_t := g_t - \frac{\gamma_t}{t^a} d_{t-1}$ with $a\ge \tfrac{1}{2}$, and both the first- and second-order moments are computed in CG-like form, including unbiased bias-corrected $\hat m_t$ and bounded $\hat v_t$. The authors prove convergence under standard assumptions, including constant $\beta_1$ and unbiased moment estimation, with a rate characterized by $\min_t \mathbb{E}\|\nabla f(x_t)\|^2 = O(S_1(T)/S_2(T))$ and a corollary giving a $O(\log T/\sqrt{T})$ rate for $\alpha_t=\alpha/t^b$, $b\in[1/2,1)$. Empirical results on CIFAR-10/100 using VGG-19 and ResNet-34 show faster convergence, lower training losses, and improved generalization relative to Adam and CoBA, supporting the approach's practical impact for deep learning optimization.

Abstract

Training deep neural networks is a challenging task. In order to speed up training and enhance the performance of deep neural networks, we rectify the vanilla conjugate gradient as conjugate-gradient-like and incorporate it into the generic Adam, and thus propose a new optimization algorithm named CG-like-Adam for deep learning. Specifically, both the first-order and the second-order moment estimation of generic Adam are replaced by the conjugate-gradient-like. Convergence analysis handles the cases where the exponential moving average coefficient of the first-order moment estimation is constant and the first-order moment estimation is unbiased. Numerical experiments show the superiority of the proposed algorithm based on the CIFAR10/100 dataset.

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

TL;DR

The paper tackles convergence challenges of Adam-type optimizers for non-convex deep-learning objectives. It proposes CG-like-Adam, where the update direction is conjugate-gradient-like:

with

, and both the first- and second-order moments are computed in CG-like form, including unbiased bias-corrected

and bounded

. The authors prove convergence under standard assumptions, including constant

and unbiased moment estimation, with a rate characterized by

and a corollary giving a

rate for

. Empirical results on CIFAR-10/100 using VGG-19 and ResNet-34 show faster convergence, lower training losses, and improved generalization relative to Adam and CoBA, supporting the approach's practical impact for deep learning optimization.

Abstract

Paper Structure (16 sections, 14 theorems, 105 equations, 13 figures, 2 algorithms)

This paper contains 16 sections, 14 theorems, 105 equations, 13 figures, 2 algorithms.

Introduction
Preliminaries
Notation
Stochastic Optimization, Generic Adam and Stationary Point
Vanilla Conjugate Gradient
CG-like-Adam
Proposed Algorithm
Assumptions and Convergence Analysis
Experiments
Compare CG-like-Adam with CoBA
Compare CG-like-Adam with Adam
Conclusion
Proof of Some Lemmas
Proof of Theorem \ref{['th3.1']}
Proof of Theorem \ref{['th3.2']}
...and 1 more sections

Key Result

Theorem 3.1

Suppose that the assumptions Aass31-Aass34 are satisfied. $\beta_{1t} \in [0,1)$, $\beta_{1t} \leq \beta_{1(t+1)}$, $\beta_{1(t+1)} \leq \beta_{1t} h(t)$ (or $\beta_{1t} h(t) \leq \beta_{1(t+1)}$) hold for all $t \in \mathcal{T}$, in which $h(t)=\frac{(1-\beta_{11}^{t-1})(1-\beta_{11}^{t+1})}{(1-\be where $C_{1}$,$C_{2}$,$C_{3}$ and $C_{4}$ are both constant independent of $T$, $\mu_{t}=\frac{\alp

Figures (13)

Figure 1: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, HS(\ref{['eq4']}))
Figure 2: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, FR(\ref{['eq5']}))
Figure 3: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, PRP(\ref{['eq6']}))
Figure 4: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, DY(\ref{['eq7']}))
Figure 5: CG-like-Adam V.S. CoBA under different learning rates. (VGG-19, CIFAR-10, HZ(\ref{['eq8']}))
...and 8 more figures

Theorems & Definitions (31)

Theorem 3.1
proof
Theorem 3.2
proof
Corollary 3.1
proof
Lemma 1
proof
Lemma 2
proof
...and 21 more

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

TL;DR

Abstract

Conjugate-Gradient-like Based Adaptive Moment Estimation Optimization Algorithm for Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (31)