Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

Yongqi Li; Xiaowei Zhang

Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

Yongqi Li, Xiaowei Zhang

TL;DR

The paper addresses the instability and generalization gaps often seen with adaptive optimizers like Adam by introducing PadamP, a projection-gradient optimizer that uses the $p$-th power of second-order moments under scale invariance. PadamP blends ideas from Madgrad, Padam, and AdamP to realize a partially adaptive momentum with a projection that controls weight-norm growth, detecting scale-invariant directions via cosine similarity and a fixed $\delta$ threshold. The authors prove a non-convex convergence result that decouples the first- and second-order moment coefficients, establishing that $\lim_{T\to\infty} \min_{t \le T} \mathbb{E}[\|g_t\|^2] = 0$ under reasonable conditions. Empirical results on CIFAR-10/100 using VGG-16 and ResNet-18 show faster convergence and better generalization than baselines, with dataset- and architecture-specific optimal $p$ values and potential gains from adaptive $p$ scheduling. Overall, PadamP provides a practically effective optimizer that maintains strong convergence properties while improving generalization in deep learning tasks.

Abstract

Training deep neural networks is challenging. To accelerate training and enhance performance, we propose PadamP, a novel optimization algorithm. PadamP is derived by applying the adaptive estimation of the p-th power of the second-order moments under scale invariance, enhancing projection adaptability by modifying the projection discrimination condition. It is integrated into Adam-type algorithms, accelerating training, boosting performance, and improving generalization in deep learning. Combining projected gradient benefits with adaptive moment estimation, PadamP tackles unconstrained non-convex problems. Convergence for the non-convex case is analyzed, focusing on the decoupling of first-order moment estimation coefficients and second-order moment estimation coefficients. Unlike prior work relying on , our proof generalizes the convergence theorem, enhancing practicality. Experiments using VGG-16 and ResNet-18 on CIFAR-10 and CIFAR-100 show PadamP's effectiveness, with notable performance on CIFAR-10/100, especially for VGG-16. The results demonstrate that PadamP outperforms existing algorithms in terms of convergence speed and generalization ability, making it a valuable addition to the field of deep learning optimization.

Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

TL;DR

The paper addresses the instability and generalization gaps often seen with adaptive optimizers like Adam by introducing PadamP, a projection-gradient optimizer that uses the

-th power of second-order moments under scale invariance. PadamP blends ideas from Madgrad, Padam, and AdamP to realize a partially adaptive momentum with a projection that controls weight-norm growth, detecting scale-invariant directions via cosine similarity and a fixed

threshold. The authors prove a non-convex convergence result that decouples the first- and second-order moment coefficients, establishing that

under reasonable conditions. Empirical results on CIFAR-10/100 using VGG-16 and ResNet-18 show faster convergence and better generalization than baselines, with dataset- and architecture-specific optimal

values and potential gains from adaptive

scheduling. Overall, PadamP provides a practically effective optimizer that maintains strong convergence properties while improving generalization in deep learning tasks.

Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

TL;DR

Abstract

Adaptive Moment Estimation Optimization Algorithm Using Projection Gradient for Deep Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (10)