Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Huixiu Jiang; Ling Yang; Yu Bao; Rutong Si; Sikun Yang

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Huixiu Jiang, Ling Yang, Yu Bao, Rutong Si, Sikun Yang

TL;DR

Adaptive Gradient Regularization (AGR) introduces per-element gradient coefficients computed from gradient magnitudes to regulate gradient updates, guiding a more favorable descent direction and smoothing the loss landscape. Implemented with only three lines of code added to optimizers like AdamW and Adan, AGR also enables per-parameter learning-rate adjustments guided by gradient magnitude. Theoretical results guarantee restricted Lipschitzness of the updated gradient and an adaptive learning-rate mechanism, while experiments across DDPM diffusion, image classification, and ALBERT language modeling demonstrate improved training efficiency and generalization. The approach is positioned as a lightweight, broadly applicable enhancement for deep neural networks, with future work aiming to scale AGR to transformers and large language models and to refine scheduling for late-training stages.

Abstract

Stochastic optimization plays a crucial role in the advancement of deep learning technologies. Over the decades, significant effort has been dedicated to improving the training efficiency and robustness of deep neural networks, via various strategies including gradient normalization (GN) and gradient centralization (GC). Nevertheless, to the best of our knowledge, no one has considered to capture the optimal gradient descent trajectory, by adaptively controlling gradient descent direction. To address this concern, this paper is the first attempt to study a new optimization technique for deep neural networks, using the sum normalization of a gradient vector as coefficients, to dynamically regularize gradients and thus to effectively control optimization direction. The proposed technique is hence named as the adaptive gradient regularization (AGR). It can be viewed as an adaptive gradient clipping method. The theoretical analysis reveals that the AGR can effectively smooth the loss landscape, and hence can significantly improve the training efficiency and model generalization performance. We note that AGR can greatly improve the training efficiency of vanilla optimizers' including Adan and AdamW, by adding only three lines of code. The final experiments conducted on image generation, image classification, and language representation, demonstrate that the AGR method can not only improve the training efficiency but also enhance the model generalization performance.

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

TL;DR

Abstract

Paper Structure (16 sections, 4 equations, 3 figures, 8 tables, 2 algorithms)

This paper contains 16 sections, 4 equations, 3 figures, 8 tables, 2 algorithms.

1. Introduction
2. Related Work
3 Adaptive Gradient Regularization
3.1 Notations
3.2 AGR Formulation
3.3 Applying AGR to AdamW/Adan Optimizers
4. AGR Properties
5. Experimental Results
5.1 Experimental Setup
5.2 Generative Model: DDPM
5.3 Supervised Classification on TinyImageNet and CIFAR100
5.3.1 Results on CIFAR100
5.3.2 Results on Tiny-ImageNet
5.4 Language Representations: ALBERT
5.5 Ablation Studies
...and 1 more sections

Figures (3)

Figure 1: (a),(b) are sketches of how the AGR is embedded into the vanilla optimizer. W is the weight tensor, $\mathcal{L}$ is the loss function, $\nabla_{w}\mathcal{L}$ is the gradient of weight, and $\Psi(\nabla_{w}\mathcal{L})$ is the gradient with AGR method. (c) is the sketch of the AGR calculation,$\left|\nabla_{w}\mathcal{L}\right|$ is the absolute value of the gradient, $\sum\left|\nabla_{w}\mathcal{L}\right|$ is the sum of $\left|\nabla_{w}\mathcal{L}\right|$ w.r.t all dimensions. The black line represents the ratio, we can obtain the corresponding coefficient matrix.
Figure 2: Training loss and test accuracy of ResNet18 structures in Tiny-Imagenet, AGR represents AGR is embedded into AdamW optimizer.
Figure 3: Training loss and test accuracy in Tiny-Imagenet, AGR represents AGR is embedded into AdamW optimizer.

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

TL;DR

Abstract

Adaptive Gradient Regularization: A Faster and Generalizable Optimization Technique for Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (3)