Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio; Konstantin Mishchenko

Learning-Rate-Free Learning by D-Adaptation

Aaron Defazio, Konstantin Mishchenko

TL;DR

This work introduces D-Adaptation, a learning-rate-free approach for convex Lipschitz minimization that maintains a data-driven lower bound on the distance to the solution to achieve the optimal DG/√n convergence without back-tracking or extra evaluations. It blends dual averaging with adaptive distance estimation and extends naturally to AdaGrad and Adam variants, providing strong asymptotic guarantees and favorable non-asymptotic behavior. The method demonstrates strong empirical performance across convex problems and diverse deep-learning tasks, often matching hand-tuned learning rates without hyper-parameter searches. Theoretical contributions include core DA bounds, asymptotic and non-asymptotic analyses, and coordinate-wise extensions, along with practical guidance and an open-source implementation for broad adoption.

Abstract

D-Adaptation is an approach to automatically setting the learning rate which asymptotically achieves the optimal rate of convergence for minimizing convex Lipschitz functions, with no back-tracking or line searches, and no additional function value or gradient evaluations per step. Our approach is the first hyper-parameter free method for this class without additional multiplicative log factors in the convergence rate. We present extensive experiments for SGD and Adam variants of our method, where the method automatically matches hand-tuned learning rates across more than a dozen diverse machine learning problems, including large-scale vision and language problems. An open-source implementation is available.

Learning-Rate-Free Learning by D-Adaptation

TL;DR

Abstract

Paper Structure (40 sections, 29 theorems, 169 equations, 10 figures, 13 tables, 5 algorithms)

This paper contains 40 sections, 29 theorems, 169 equations, 10 figures, 13 tables, 5 algorithms.

Introduction
Algorithm
Why Dual Averaging?
D-Adapted AdaGrad
Discussion
Different ways to estimate D
Limitations
Related Work
Polyak step size
Exact line searches
Bisection
DoG
Coin-betting
Reward Doubling
Machine Learning Applications
...and 25 more sections

Key Result

Theorem 1

For a convex $G$-Lipschitz function $f$, Algorithm alg:mainalg returns a point $\hat{x}_{n}$ such that: as $n \rightarrow \infty$, where $D=\left\Vert x_{0}-x_{*}\right\Vert$ for any $x_{*}$ in the set of minimizers of $f$, as long as $d_0\leq D$.

Figures (10)

Figure 1: Toy problem illustrating the estimate of $D$ over time, $f(x)=|x|$. $x_0=1.0$ is shown as a blue dot on the left plot, and the following iterates are shown in purple.
Figure 2: SGD with D-Adaptation
Figure 3: Adam with D-Adaptation
Figure 4: Logistic Regression experiments.
Figure 5: Image Classification experiments.
...and 5 more figures

Theorems & Definitions (29)

Theorem 1
Theorem 2
Theorem 3
Theorem 4
Lemma 5
Lemma 6
Theorem 7
Lemma 8
Proposition 9
Lemma 10
...and 19 more

Learning-Rate-Free Learning by D-Adaptation

TL;DR

Abstract

Learning-Rate-Free Learning by D-Adaptation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (29)