Interpreting Adaptive Gradient Methods by Parameter Scaling for Learning-Rate-Free Optimization
Min-Kook Suh, Seung-Woo Seo
TL;DR
This work recasts adaptive gradient methods as parameter-scaled SGD by showing that $\alpha$-based gradient scaling equates to parameter scaling, enabling learning-rate-free variants for adaptive optimizers. It introduces two methods, PS-SPS and PS-DA-SGD, that adapt existing LR-free strategies to adaptive gradients and proves basic convergence under AMSGrad-style scaling. Empirically, PS-DA-SGD delivers robust performance across vision, NLP, reinforcement learning, and self-supervised settings, often matching or approaching hand-tuned Adam, while PS-SPS offers competitive results in stable regimes. The approach broadens the applicability of learning-rate-free optimization to modern deep learning with adaptive gradient methods, though challenges like overfitting on small datasets remain to be addressed.
Abstract
We address the challenge of estimating the learning rate for adaptive gradient methods used in training deep neural networks. While several learning-rate-free approaches have been proposed, they are typically tailored for steepest descent. However, although steepest descent methods offer an intuitive approach to finding minima, many deep learning applications require adaptive gradient methods to achieve faster convergence. In this paper, we interpret adaptive gradient methods as steepest descent applied on parameter-scaled networks, proposing learning-rate-free adaptive gradient methods. Experimental results verify the effectiveness of this approach, demonstrating comparable performance to hand-tuned learning rates across various scenarios. This work extends the applicability of learning-rate-free methods, enhancing training with adaptive gradient methods.
