Table of Contents
Fetching ...

Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

Yongqi Li, Xiaowei Zhang

TL;DR

The paper addresses the instability and generalization gaps often seen with adaptive optimizers like Adam by introducing PadamP, a projection-gradient optimizer that uses the $p$-th power of second-order moments under scale invariance. PadamP blends ideas from Madgrad, Padam, and AdamP to realize a partially adaptive momentum with a projection that controls weight-norm growth, detecting scale-invariant directions via cosine similarity and a fixed $\delta$ threshold. The authors prove a non-convex convergence result that decouples the first- and second-order moment coefficients, establishing that $\lim_{T\to\infty} \min_{t \le T} \mathbb{E}[\|g_t\|^2] = 0$ under reasonable conditions. Empirical results on CIFAR-10/100 using VGG-16 and ResNet-18 show faster convergence and better generalization than baselines, with dataset- and architecture-specific optimal $p$ values and potential gains from adaptive $p$ scheduling. Overall, PadamP provides a practically effective optimizer that maintains strong convergence properties while improving generalization in deep learning tasks.

Abstract

Training deep neural networks is challenging. To accelerate training and enhance performance, we propose PadamP, a novel optimization algorithm. PadamP is derived by applying the adaptive estimation of the p-th power of the second-order moments under scale invariance, enhancing projection adaptability by modifying the projection discrimination condition. It is integrated into Adam-type algorithms, accelerating training, boosting performance, and improving generalization in deep learning. Combining projected gradient benefits with adaptive moment estimation, PadamP tackles unconstrained non-convex problems. Convergence for the non-convex case is analyzed, focusing on the decoupling of first-order moment estimation coefficients and second-order moment estimation coefficients. Unlike prior work relying on , our proof generalizes the convergence theorem, enhancing practicality. Experiments using VGG-16 and ResNet-18 on CIFAR-10 and CIFAR-100 show PadamP's effectiveness, with notable performance on CIFAR-10/100, especially for VGG-16. The results demonstrate that PadamP outperforms existing algorithms in terms of convergence speed and generalization ability, making it a valuable addition to the field of deep learning optimization.

Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

TL;DR

The paper addresses the instability and generalization gaps often seen with adaptive optimizers like Adam by introducing PadamP, a projection-gradient optimizer that uses the -th power of second-order moments under scale invariance. PadamP blends ideas from Madgrad, Padam, and AdamP to realize a partially adaptive momentum with a projection that controls weight-norm growth, detecting scale-invariant directions via cosine similarity and a fixed threshold. The authors prove a non-convex convergence result that decouples the first- and second-order moment coefficients, establishing that under reasonable conditions. Empirical results on CIFAR-10/100 using VGG-16 and ResNet-18 show faster convergence and better generalization than baselines, with dataset- and architecture-specific optimal values and potential gains from adaptive scheduling. Overall, PadamP provides a practically effective optimizer that maintains strong convergence properties while improving generalization in deep learning tasks.

Abstract

Training deep neural networks is challenging. To accelerate training and enhance performance, we propose PadamP, a novel optimization algorithm. PadamP is derived by applying the adaptive estimation of the p-th power of the second-order moments under scale invariance, enhancing projection adaptability by modifying the projection discrimination condition. It is integrated into Adam-type algorithms, accelerating training, boosting performance, and improving generalization in deep learning. Combining projected gradient benefits with adaptive moment estimation, PadamP tackles unconstrained non-convex problems. Convergence for the non-convex case is analyzed, focusing on the decoupling of first-order moment estimation coefficients and second-order moment estimation coefficients. Unlike prior work relying on , our proof generalizes the convergence theorem, enhancing practicality. Experiments using VGG-16 and ResNet-18 on CIFAR-10 and CIFAR-100 show PadamP's effectiveness, with notable performance on CIFAR-10/100, especially for VGG-16. The results demonstrate that PadamP outperforms existing algorithms in terms of convergence speed and generalization ability, making it a valuable addition to the field of deep learning optimization.

Paper Structure

This paper contains 13 sections, 6 theorems, 50 equations, 15 figures, 1 table, 2 algorithms.

Key Result

Lemma 1

Let $\|\theta_{t}^{\mathrm{GD}}\|_{2}$ and $\|\theta_{t}^{\mathrm{GDM}}\|_{2}$ be the weight norms at step $t \geq 0$ , following the recursive formula in Equation eq14 and Equation eq15, respectively. We assume that the norms of the updates $\left\|p_{t}\right\|_{2}$ for GD with and without momentu

Figures (15)

  • Figure 1: Normalization layer and scale invariance
  • Figure 2: Vector directions of the gradient, momentum, and ours.
  • Figure 3: PadamP under different learning rates. (train VGG-16 on CIFAR-10)
  • Figure 4: Different p-values trained on VGG-16 network, CIFAR-10 data(learning rate not decay).
  • Figure 5: Different p-values trained on VGG-16 network, CIFAR-100 data(learning rate not decay)
  • ...and 10 more figures

Theorems & Definitions (10)

  • Lemma 1
  • Theorem 7
  • Lemma 8
  • proof 1
  • Lemma 9
  • proof 2
  • Lemma 10
  • proof 3
  • Lemma 11
  • proof 4