Table of Contents
Fetching ...

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

Yuanzhe Tao, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu

TL;DR

It is proved that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions, and Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.

Abstract

Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of AdaGrad++ and Adam++.

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

TL;DR

It is proved that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions, and Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates.

Abstract

Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, adhoc tuning of learning rates poses a challenge, leading to inefficiencies in practice. To address this issue, recent research has focused on developing "learning-rate-free" or "parameter-free" algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of AdaGrad++ and Adam++.
Paper Structure (26 sections, 7 theorems, 50 equations, 16 figures, 3 tables, 2 algorithms)

This paper contains 26 sections, 7 theorems, 50 equations, 16 figures, 3 tables, 2 algorithms.

Key Result

Theorem 4.2

Let $\mathbf{x}_0,\ldots, \mathbf{x}_T$ be the iterates of AdaGrad++. Further let $\tau\in \text{arg}\max_{t\leq T}\sum_{i=0}^{t-1}\frac{\eta_i}{\eta_t}$ and define $\overline{\mathbf{x}}_{\tau}=\frac{\sum_{t=0}^{\tau-1}\eta_t\mathbf{x}_t}{\sum_{t=0}^{\tau-1}\eta_t}$. Then under Assumption asp:gradi where $D_{\tau}=\max_{t\leq \tau}\|\mathbf{x}_{t} - \mathbf{x}^*\|_{\infty}$, $\overline{D}_{\tau}=

Figures (16)

  • Figure 1: The results of training ResNet-18, ResNet-50, and VGG16 on CIFAR-10 with a constant learning rate schedule. Each curve represents the mean of 8 random runs, with the shaded area indicating the standard error. The first row presents the test accuracy of different algorithms, and the second row shows the training losses. Adam++ achieves performance superior or comparable to Adam.
  • Figure 2: The results of training ResNet-18, ResNet-50, and VGG16 on CIFAR-10 with a cosine learning rate schedule. Each curve represents the mean of 8 random runs, with the shaded area indicating the standard error. The first row presents the test accuracy of different algorithms, and the second row shows the training losses.
  • Figure 3: Comparison of training GPT-2 Small (155M) on OpenWebText. Left: Test loss. Performance at 50k steps—AdamW: 3.00, D-Adapt AdamW: 3.01, Prodigy: 3.01, Adam++: 2.98. Right: Train loss. Performance at 50k steps—AdamW: 2.97, D-Adapt AdamW: 2.97, Prodigy: 2.98, AdamW++: 2.95. AdamW++ refers to AdamW++ (Case 2).
  • Figure 4: Comparison of training GPT-2 Medium (355M) on OpenWebText. Left: Test loss. Performance at 50k steps—AdamW: 2.80, D-Adapt AdamW: 2.87, Prodigy: 2.80, AdamW++: 2.78. Right: Train loss. Performance at 50k steps—AdamW: 2.75, D-Adapt AdamW: 2.82, Prodigy: 2.75, AdamW++: 2.73. AdamW++ refers to AdamW++ (Case 2).
  • Figure 5: Effect of different choices of $\eta_0$ on test accuracy and training losses. When $\eta_0$ is less than $10^{-1}$, its influence on final performance is marginal.
  • ...and 11 more figures

Theorems & Definitions (7)

  • Theorem 4.2
  • Corollary 4.3
  • Corollary 4.4
  • Theorem 5.1
  • Corollary 5.2
  • Lemma C.1
  • Lemma C.2