On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

Guoqiang Zhang

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

Guoqiang Zhang

TL;DR

SET-Adam addresses generalization gaps of adaptive optimizers by compressing the range of per-parameter stepsizes using layerwise statistics. It introduces three operations—down-scaling the second moment by layer, epsilon-embedding inside the sqrt, and down-translating to lift small stepsizes—leading to updates that are closer to SGD with momentum. The authors provide a convex convergence analysis and extensive experiments across Transformers, LSTMs, VGG/ResNet, WGAN-GP, and ImageNet, showing SET-Adam consistently outperforms eight adaptive optimizers and often surpasses Adam and AdaBelief, with modest overhead. The work suggests careful design of adaptive stepsize range can yield substantial generalization benefits in diverse DNNs.

Abstract

A number of recent adaptive optimizers improve the generalisation performance of Adam by essentially reducing the variance of adaptive stepsizes to get closer to SGD with momentum. Following the above motivation, we suppress the range of the adaptive stepsizes of Adam by exploiting the layerwise gradient statistics. In particular, at each iteration, we propose to perform three consecutive operations on the second momentum v_t before using it to update a DNN model: (1): down-scaling, (2): epsilon-embedding, and (3): down-translating. The resulting algorithm is referred to as SET-Adam, where SET is a brief notation of the three operations. The down-scaling operation on v_t is performed layerwise by making use of the angles between the layerwise subvectors of v_t and the corresponding all-one subvectors. Extensive experimental results show that SET-Adam outperforms eight adaptive optimizers when training transformers and LSTMs for NLP, and VGG and ResNet for image classification over CIAF10 and CIFAR100 while matching the best performance of the eight adaptive methods when training WGAN-GP models for image generation tasks. Furthermore, SET-Adam produces higher validation accuracies than Adam and AdaBelief for training ResNet18 over ImageNet.

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

TL;DR

Abstract

Paper Structure (19 sections, 2 theorems, 17 equations, 7 figures, 10 tables, 2 algorithms)

This paper contains 19 sections, 2 theorems, 17 equations, 7 figures, 10 tables, 2 algorithms.

Introduction
Algorithmic Design of SET-Adam
Motivation of layerwise down-scaling operation
Design of layerwise down-scaling operation
$\epsilon$-embedding for suppressing range of adaptive stepsizes
Down-translating for avoiding extreme small adaptive stepsizes
Convergence analysis
Experiments
On training a transformer
On training LSTMs
On training VGG11 and ResNet34 over CIFAR10 and CIFAR100
On training WGAN-GP over CIFAR10
On training ResNet18 over ImageNet
Conclusions
Verification that AdaBelief also utilizes the $\epsilon$-embedding operation
...and 4 more sections

Key Result

Theorem 1

Suppose $\{\boldsymbol{\theta}_t\}_{t=0}^T$ and $\{\tilde{\boldsymbol{w}}_t\}_{t=0}^T$ are the iterative updates obtained by running SET-Adam$\beta_1$ in Algorithm 1 is generalized to be $\beta_{1t}$, $t\geq 0$ to facilitate convergence analysis. AdaBelief was analyzed in a similar manner. starting

Figures (7)

Figure 1: Comparison of layerwise average of adaptive stepsizes for the 11 neural layers of VGG11 by training over CIFAR10 for 200 epochs. See Appendix \ref{['appendix:fig_setup']} for the parameter setups of the two methods, where the optimal parameter $\epsilon$ for Adam was selected from a discrete set to give the best validation performance. The jumps in the curves at 100 and 160 epochs are due to the change in the common stepsize. SET-Adam has a much more compact range of layerwise average stepsizes than Adam.
Figure 2: Comparison of layerwise standard deviations (stds) of adaptive stepsizes for the 11 neural layers by training VGG11 over CIFAR10 for 200 epochs. SET-Adam has much smaller layerwise stds than Adam for ten out of eleven nueral layers.
Figure 3: Demonstration of the down-scaling operation in SET-Adam. The vector $\boldsymbol{1}_l$ is of the same dimension as $\boldsymbol{v}_{l,t}$
Figure 4: Comparison of layerwise average of adaptive stepsizes for the 11 neural layers by training VGG11 over CIFAR10 for 200 epochs. For the plot of SET-Adam, the down-translating operation is ignored and only the first two operations are included. See Appendix \ref{['appendix:fig_setup']} for the algorithmic parameter setups.
Figure 5: Performance visualisation of Adam and SET-Adam for the training of the transformer.
...and 2 more figures

Theorems & Definitions (5)

Theorem 1
proof
Remark 1
proof
Lemma 1

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

TL;DR

Abstract

On Suppressing Range of Adaptive Stepsizes of Adam to Improve Generalisation Performance

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (5)