AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

Yun Yue; Zhiling Ye; Jiadi Jiang; Yongchao Liu; Ke Zhang

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

Yun Yue, Zhiling Ye, Jiadi Jiang, Yongchao Liu, Ke Zhang

TL;DR

AGD introduces a gradient-difference driven diagonal preconditioner to encode Hessian information and an auto-switch mechanism controlled by a threshold ${\delta}$ to alternately apply SGD-like momentum or adaptive updates. Theoretical guarantees for non-convex and convex stochastic optimization are established, including convergence and regret bounds. Empirically, AGD matches or surpasses state-of-the-art optimizers across NLP, CV, and RecSys tasks, with favorable compute and memory profiles. The approach provides a tunable path to robust generalization by balancing exploration and curvature-aware updates, with code available for replication.

Abstract

Adaptive optimizers, such as Adam, have achieved remarkable success in deep learning. A key component of these optimizers is the so-called preconditioning matrix, providing enhanced gradient information and regulating the step size of each gradient direction. In this paper, we propose a novel approach to designing the preconditioning matrix by utilizing the gradient difference between two successive steps as the diagonal elements. These diagonal elements are closely related to the Hessian and can be perceived as an approximation of the inner product between the Hessian row vectors and difference of the adjacent parameter vectors. Additionally, we introduce an auto-switching function that enables the preconditioning matrix to switch dynamically between Stochastic Gradient Descent (SGD) and the adaptive optimizer. Based on these two techniques, we develop a new optimizer named AGD that enhances the generalization performance. We evaluate AGD on public datasets of Natural Language Processing (NLP), Computer Vision (CV), and Recommendation Systems (RecSys). Our experimental results demonstrate that AGD outperforms the state-of-the-art (SOTA) optimizers, achieving highly competitive or significantly better predictive performance. Furthermore, we analyze how AGD is able to switch automatically between SGD and the adaptive optimizer and its actual effects on various scenarios. The code is available at https://github.com/intelligent-machine-learning/atorch/tree/main/atorch/optimizers.

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

TL;DR

AGD introduces a gradient-difference driven diagonal preconditioner to encode Hessian information and an auto-switch mechanism controlled by a threshold

to alternately apply SGD-like momentum or adaptive updates. Theoretical guarantees for non-convex and convex stochastic optimization are established, including convergence and regret bounds. Empirically, AGD matches or surpasses state-of-the-art optimizers across NLP, CV, and RecSys tasks, with favorable compute and memory profiles. The approach provides a tunable path to robust generalization by balancing exploration and curvature-aware updates, with code available for replication.

Abstract

Paper Structure (33 sections, 9 theorems, 42 equations, 8 figures, 8 tables, 2 algorithms)

This paper contains 33 sections, 9 theorems, 42 equations, 8 figures, 8 tables, 2 algorithms.

Introduction
Related work
Algorithm
Details of AGD optimizer
Gradient difference
Auto switch
Comparison with other optimizers
Comparison with AdaBound
Comparison with AdaBelief
Numerical analysis
Experiments
Experiment setup
NLP
CV
RecSys
...and 18 more sections

Key Result

Theorem 1

(Convergence in non-convex settings) Suppose that the following assumptions are satisfied: Then Algorithm alg:AGD yields where $C_3$, $C_4$ and $C_5$ are defined as follows:

Figures (8)

Figure 1: AGD
Figure 2: Comparison of stability between AGD and AdaBelief relative to the parameter $\delta~(\hbox{or}~\epsilon)$ for ResNet32 on Cifar10. AGD shows better stability over a wide range of $\delta~(\hbox{or}~\epsilon)$ variations than AdaBelief.
Figure 3: Trajectories of different optimizers in three test functions, where $f(x,y)=(x+y)^2+(x-y)^2/10$. We also provide animated versions at https://youtu.be/Qv5X3v5YUw0.
Figure 4: Test PPL ([$\mu \pm \sigma$]) on Penn Treebank for 1,2,3-layer LSTM.
Figure 5: Test accuracy ([$\mu \pm \sigma$]) of different optimizers for ResNet20/32 on Cifar10 and ResNet18 on ImageNet.
...and 3 more figures

Theorems & Definitions (17)

Theorem 1
Corollary 1
Corollary 2
Theorem 2
Corollary 3
proof
Lemma 1
proof
Lemma 2
proof
...and 7 more

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

TL;DR

Abstract

AGD: an Auto-switchable Optimizer using Stepwise Gradient Difference for Preconditioning Matrix

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (17)