Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

Tao Shi; Liangming Chen; Long Jin; Mengchu Zhou

Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

Tao Shi, Liangming Chen, Long Jin, Mengchu Zhou

TL;DR

Dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance is proposed, and the diffusion theory is introduced to mathematically demonstrate InvAdam's ability to escape sharp minima.

Abstract

In the training of neural networks, adaptive moment estimation (Adam) typically converges fast but exhibits suboptimal generalization performance. A widely accepted explanation for its defect in generalization is that it often tends to converge to sharp minima. To enhance its ability to find flat minima, we propose its new variant named inverse Adam (InvAdam). The key improvement of InvAdam lies in its parameter update mechanism, which is opposite to that of Adam. Specifically, it computes element-wise multiplication of the first-order and second-order moments, while Adam computes the element-wise division of these two moments. This modification aims to increase the step size of the parameter update when the elements in the second-order moments are large and vice versa, which helps the parameter escape sharp minima and stay at flat ones. However, InvAdam's update mechanism may face challenges in convergence. To address this challenge, we propose dual Adam (DualAdam), which integrates the update mechanisms of both Adam and InvAdam, ensuring convergence while enhancing generalization performance. Additionally, we introduce the diffusion theory to mathematically demonstrate InvAdam's ability to escape sharp minima. Extensive experiments are conducted on image classification tasks and large language model (LLM) fine-tuning. The results validate that DualAdam outperforms Adam and its state-of-the-art variants in terms of generalization performance. The code is publicly available at https://github.com/LongJin-lab/DualAdam.

Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

TL;DR

Abstract

Paper Structure (21 sections, 1 theorem, 34 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 1 theorem, 34 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related work
Adam Variants
Switching between Two Update Mechanisms
Proposed Methods
DualAdam
Computational Complexity Analysis of DualAdam
Theoretical Analysis of Ability to Escape Sharp Minima
Convergence Analysis
Simulations and Experiments
Numerical Simulations on 2-Parameter Loss Landscapes
Image Classification on CIFAR-10 and CIFAR-100
Image Classification on Tiny ImageNet
Image Classification on ImageNet-1k
Fine-Tuning on Large Language Model
...and 6 more sections

Key Result

Theorem 1

If Assumptions 1--3 hold, and the dynamics is governed by InvAdam, then the mean escape time from minimum $\boldsymbol{\phi}$ to the outside of $\boldsymbol{\phi}$ is where $H_{\boldsymbol{\chi e}}$ is the eigenvalue of the Hessian matrix of the loss function at saddle point $\boldsymbol{\chi}$ along escape direction $\boldsymbol{e}$ and $s \in (0,1)$ is a path-dependent coefficient.

Figures (6)

Figure 1: Mian idea of DualAdam. (a) The relationship between sharp minimum $\phi$ and flat one $\psi$. $\chi$ is the saddle point between $\phi$ and $\psi$; (b) The update mechanisms of Adam and InvAdam. $\boldsymbol{\hat{m}}$ and $\boldsymbol{\hat{v}}$ are the bias-corrected first- and second-order moments, respectively. Subscript $i$ represents the $i$-th element in the vector. $\Delta \boldsymbol{\theta}_i$ is the $i$-th element in parameter update in one iteration; and (c) The update mechanism of DualAdam.
Figure 2: Optimization trajectories of InvAdam and Adam on 2-parameter loss landscapes. The red stars represent the start points, and the black circles represent the end points.
Figure 3: Test accuracies over epochs on CIFAR-100.
Figure 4: Comparisons of training loss, validation perplexity, and generalization gap of DualAdam and AdamW on the fine-tuning of OpenPangu-1B.
Figure 5: Comparisons of top Hessian matrix's eigenvalues, Hessian matrix's traces, and Hessian matrix's eigenvalue densities of loss landscapes on the CIFAR-100 dataset using ResNet18.
...and 1 more figures

Theorems & Definitions (1)

Theorem 1

Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

TL;DR

Abstract

Combining Adam and its Inverse Counterpart to Enhance Generalization of Deep Learning Optimizers

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (1)