Table of Contents
Fetching ...

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

Konstantin Mishchenko, Aaron Defazio

TL;DR

The paper tackles the problem of tuning learning rates in adaptive optimizers by introducing Prodigy, a parameter-free learner that estimates the distance to the solution using AdaGrad-like step sizes. Prodigy modifies D-Adaptation to yield faster non-asymptotic convergence, establishes a lower bound framework for exponentially bounded algorithms, and derives Adam-like step-size variants to extend applicability. Theoretical results show improved rates with a sqrt(log(D/d0)) factor, while extensive experiments across logistic regression, CIFAR-10, transformers, LSTM, and other large models demonstrate consistent performance gains over D-Adaptation and close alignment with hand-tuned Adam. The approach is practical and has already achieved wide adoption in real-world training pipelines, including Hugging Face Diffusers and LoRA-based workflows.

Abstract

We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution $D$, which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of $O(\sqrt{\log(D/d_0)})$, where $d_0$ is the initial estimate of $D$. We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam.

Prodigy: An Expeditiously Adaptive Parameter-Free Learner

TL;DR

The paper tackles the problem of tuning learning rates in adaptive optimizers by introducing Prodigy, a parameter-free learner that estimates the distance to the solution using AdaGrad-like step sizes. Prodigy modifies D-Adaptation to yield faster non-asymptotic convergence, establishes a lower bound framework for exponentially bounded algorithms, and derives Adam-like step-size variants to extend applicability. Theoretical results show improved rates with a sqrt(log(D/d0)) factor, while extensive experiments across logistic regression, CIFAR-10, transformers, LSTM, and other large models demonstrate consistent performance gains over D-Adaptation and close alignment with hand-tuned Adam. The approach is practical and has already achieved wide adoption in real-world training pipelines, including Hugging Face Diffusers and LoRA-based workflows.

Abstract

We consider the problem of estimating the learning rate in adaptive methods, such as AdaGrad and Adam. We propose Prodigy, an algorithm that provably estimates the distance to the solution , which is needed to set the learning rate optimally. At its core, Prodigy is a modification of the D-Adaptation method for learning-rate-free learning. It improves upon the convergence rate of D-Adaptation by a factor of , where is the initial estimate of . We test Prodigy on 12 common logistic-regression benchmark datasets, VGG11 and ResNet-50 training on CIFAR10, ViT training on Imagenet, LSTM training on IWSLT14, DLRM training on Criteo dataset, VarNet on Knee MRI dataset, as well as RoBERTa and GPT transformer training on BookWiki. Our experimental results show that our approach consistently outperforms D-Adaptation and reaches test accuracy values close to that of hand-tuned Adam.
Paper Structure (20 sections, 22 theorems, 110 equations, 6 figures, 4 algorithms)

This paper contains 20 sections, 22 theorems, 110 equations, 6 figures, 4 algorithms.

Key Result

Theorem 1

Assume $f$ is convex and $G$-Lipschitz. Given any weights $1\le\lambda_0\le\dotsb \le\lambda_n$, the functional gap of the average iterate of Algorithm alg:dadagradv2gd converges as where $\hat{x}_n = \frac{1}{n+1}\sum_{k=0}^n \eta_k x_k$ is the weighted average iterate.

Figures (6)

  • Figure 1: Convex multiclass classification problems. Error bars show a range of 1 standard error above and below the mean of the 10 seeds.
  • Figure 2: VGG11 and ResNet-50 training on CIFAR10. Left: test accuracy (%), middle: train loss, right: step sizes. "Prodigy" is used as given in Algorithm \ref{['alg:prodigy_adam']}. As expected, Prodigy estimates a larger step size than D-Adaptation, which helps it reach test accuracy closer to the one of Adam.
  • Figure 3: The test (left) and train (middle) loss curves as well as the estimated stepsize (right) when training a 6-layer nanoGPT transformer on the Shakespeare dataset.
  • Figure 4: Adam-family experiments.
  • Figure 5: Adam-family experiments.
  • ...and 1 more figures

Theorems & Definitions (41)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Definition 1
  • Theorem 3
  • Theorem 4
  • Proposition 1: Lemma A.2 in levy2018online
  • proof
  • Proposition 2
  • proof
  • ...and 31 more