Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

Min-Kook Suh; Seung-Woo Seo

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

Min-Kook Suh, Seung-Woo Seo

TL;DR

This work recasts adaptive gradient methods as parameter-scaled SGD by showing that $\alpha$-based gradient scaling equates to parameter scaling, enabling learning-rate-free variants for adaptive optimizers. It introduces two methods, PS-SPS and PS-DA-SGD, that adapt existing LR-free strategies to adaptive gradients and proves basic convergence under AMSGrad-style scaling. Empirically, PS-DA-SGD delivers robust performance across vision, NLP, reinforcement learning, and self-supervised settings, often matching or approaching hand-tuned Adam, while PS-SPS offers competitive results in stable regimes. The approach broadens the applicability of learning-rate-free optimization to modern deep learning with adaptive gradient methods, though challenges like overfitting on small datasets remain to be addressed.

Abstract

We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. However, although steepest descent methods offer an intuitive approach to finding minima, many deep learning applications require adaptive gradient methods to achieve faster convergence. In this paper, we interpret adaptive gradient methods as steepest descent applied on parameter-scaled networks, proposing learning-rate-free adaptive gradient methods. Experimental results verify the effectiveness of this approach, demonstrating comparable performance to hand-tuned learning rates across various scenarios. This work extends the applicability of learning-rate-free methods, enhancing training with adaptive gradient methods.

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

TL;DR

This work recasts adaptive gradient methods as parameter-scaled SGD by showing that

-based gradient scaling equates to parameter scaling, enabling learning-rate-free variants for adaptive optimizers. It introduces two methods, PS-SPS and PS-DA-SGD, that adapt existing LR-free strategies to adaptive gradients and proves basic convergence under AMSGrad-style scaling. Empirically, PS-DA-SGD delivers robust performance across vision, NLP, reinforcement learning, and self-supervised settings, often matching or approaching hand-tuned Adam, while PS-SPS offers competitive results in stable regimes. The approach broadens the applicability of learning-rate-free optimization to modern deep learning with adaptive gradient methods, though challenges like overfitting on small datasets remain to be addressed.

Abstract

Paper Structure (24 sections, 19 equations, 1 figure, 4 tables, 2 algorithms)

This paper contains 24 sections, 19 equations, 1 figure, 4 tables, 2 algorithms.

Introduction
Related work
Adaptive gradient methods
Learning-rate-free methods
Motivation
Proposed method
Algorithm
Convergence analysis
Experiments
Optimizers
Testing environments
Experimental results
Analysis
Supervised classification on CIFAR-100
Reinforcement learning
...and 9 more sections

Figures (1)

Figure 1: Estimated learning rate

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

TL;DR

Abstract

Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (1)