Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Yuanzhe Tao; Huizhuo Yuan; Xun Zhou; Yuan Cao; Quanquan Gu

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu

TL;DR

It is proved that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions, and Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.

Abstract

Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of AdaGrad++ and Adam++.

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

TL;DR

Abstract

Paper Structure (26 sections, 7 theorems, 50 equations, 16 figures, 3 tables, 2 algorithms)

This paper contains 26 sections, 7 theorems, 50 equations, 16 figures, 3 tables, 2 algorithms.

Introduction
Related Work
Review of existing methods and preview of proposed methods
AdaGrad++: a parameter-free version of AdaGrad
Algorithm
Convergence Guarantee
Adam++: a parameter-free version of Adam
Algorithm
Convergence Guarantee of Adam++
Experiments
Image Classification
Large Language Model (LLM) Pretraining
Ablation Study
Base learning rate
Conclusions
...and 11 more sections

Key Result

Theorem 4.2

Let $\mathbf{x}_0,\ldots, \mathbf{x}_T$ be the iterates of AdaGrad++. Further let $\tau\in \text{arg}\max_{t\leq T}\sum_{i=0}^{t-1}\frac{\eta_i}{\eta_t}$ and define $\overline{\mathbf{x}}_{\tau}=\frac{\sum_{t=0}^{\tau-1}\eta_t\mathbf{x}_t}{\sum_{t=0}^{\tau-1}\eta_t}$. Then under Assumption asp:gradi where $D_{\tau}=\max_{t\leq \tau}\|\mathbf{x}_{t} - \mathbf{x}^*\|_{\infty}$, $\overline{D}_{\tau}=

Figures (16)

Figure 1: The results of training ResNet-18, ResNet-50, and VGG16 on CIFAR-10 with a constant learning rate schedule. Each curve represents the mean of 8 random runs, with the shaded area indicating the standard error. The first row presents the test accuracy of different algorithms, and the second row shows the training losses. Adam++ achieves performance superior or comparable to Adam.
Figure 2: The results of training ResNet-18, ResNet-50, and VGG16 on CIFAR-10 with a cosine learning rate schedule. Each curve represents the mean of 8 random runs, with the shaded area indicating the standard error. The first row presents the test accuracy of different algorithms, and the second row shows the training losses.
Figure 3: Comparison of training GPT-2 Small (155M) on OpenWebText. Left: Test loss. Performance at 50k steps—AdamW: 3.00, D-Adapt AdamW: 3.01, Prodigy: 3.01, Adam++: 2.98. Right: Train loss. Performance at 50k steps—AdamW: 2.97, D-Adapt AdamW: 2.97, Prodigy: 2.98, AdamW++: 2.95. AdamW++ refers to AdamW++ (Case 2).
Figure 4: Comparison of training GPT-2 Medium (355M) on OpenWebText. Left: Test loss. Performance at 50k steps—AdamW: 2.80, D-Adapt AdamW: 2.87, Prodigy: 2.80, AdamW++: 2.78. Right: Train loss. Performance at 50k steps—AdamW: 2.75, D-Adapt AdamW: 2.82, Prodigy: 2.75, AdamW++: 2.73. AdamW++ refers to AdamW++ (Case 2).
Figure 5: Effect of different choices of $\eta_0$ on test accuracy and training losses. When $\eta_0$ is less than $10^{-1}$, its influence on final performance is marginal.
...and 11 more figures

Theorems & Definitions (7)

Theorem 4.2
Corollary 4.3
Corollary 4.4
Theorem 5.1
Corollary 5.2
Lemma C.1
Lemma C.2

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

TL;DR

Abstract

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (7)