Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Damek Davis; Dmitriy Drusvyatskiy; Liwei Jiang

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Damek Davis, Dmitriy Drusvyatskiy, Liwei Jiang

TL;DR

This work shows that gradient descent with an adaptive, epoch-based stepsize can achieve a local (nearly) linear convergence rate for smooth functions with quartic growth away from the minimizer, challenging the conventional quadratic-growth requirement. Central to the approach is the ravine decomposition, which identifies a smooth manifold around the optimum along which the function has constant-order growth, and a short-then-long-step scheme that interleaves rapid progress toward the ravine with a powerful Polyak step. The authors prove a main convergence theorem under a precise Assumption A about the ravine and its tangent behavior, and validate the theory with three overparameterized problems: matrix sensing, matrix factorization, and learning a two-neuron network. The results provide a principled way to accelerate first-order methods in settings with degenerate local geometry and have practical implications for efficient optimization in overparameterized models. Overall, the paper contributes a rigorous framework for adaptive stepsizes via ravines that yields provable near-linear convergence in challenging nonconvex landscapes. Mathematical notation is integral to the statements, with key notions expressed through $f$, $f^*$, ${ m dist}(ullet,ullet)$, and manifold projections $P_{ m M}$.

Abstract

A prevalent belief among optimization specialists is that linear convergence of gradient descent is contingent on the function growing quadratically away from its minimizers. In this work, we argue that this belief is inaccurate. We show that gradient descent with an adaptive stepsize converges at a local (nearly) linear rate on any smooth function that merely exhibits fourth-order growth away from its minimizer. The adaptive stepsize we propose arises from an intriguing decomposition theorem: any such function admits a smooth manifold around the optimal solution -- which we call the ravine -- so that the function grows at least quadratically away from the ravine and has constant order growth along it. The ravine allows one to interlace many short gradient steps with a single long Polyak gradient step, which together ensure rapid convergence to the minimizer. We illustrate the theory and algorithm on the problems of matrix sensing and factorization and learning a single neuron in the overparameterized regime.

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

TL;DR

, and manifold projections

Abstract

Paper Structure (43 sections, 43 theorems, 216 equations, 5 figures, 2 algorithms)

This paper contains 43 sections, 43 theorems, 216 equations, 5 figures, 2 algorithms.

Introduction
Matrix sensing.
Learning a single neuron.
Related literature.
Ravines, partial smoothness, and local linear convergence.
Overparameterized matrix sensing
Gradient descent with alternating short and long steps.
Notation and preliminaries
Ravines: definition, existence, and examples
The Morse ravine is a ravine
Constant rank, uniform ravines, and lower growth
Ravines: analytic properties
Gradient control in tangent and normal directions.
Orthogonal decomposition of the function
Item \ref{['it:strong_a_in_thm']} (Projected Gradient):
...and 28 more sections

Key Result

Theorem 1.1

Consider a smooth function $f$ satisfying $f(x)-\inf f\geq \Omega( \|x-\bar{x}\|^4)$ for all $x$ near the minimizer $\bar{x}$. Then, when initialized sufficiently close to $\bar{x}$ with sufficiently small $\eta$, Algorithm alg:GD-Polyak reaches any $\varepsilon$-ball around $\bar{x}$ after $O(\log^

Figures (5)

Figure 1: The function $f(x,y)=x^4+10(y-x^2)^2$
Figure 2: Comparison of $\mathtt{GDPolyak}$ with $\mathtt{GD}$ and $\mathtt{Polyak}$ on the Rosenbrock function. $\mathtt{GDPolyak}$ proceeds in $I = 50$ epochs of length $K = 100$. During each epoch, $\mathtt{GDPolyak}$ uses the same short stepsize as $\mathtt{GD}$, i.e., $.0125$. After taking $K=100$ steps with short stepsizes, $\mathtt{GDPolyak}$ takes a step with the Polyak stepsize $\frac{f(x) - f^\ast}{\|\nabla f(x)\|^2}$.
Figure 3: Comparison of $\mathtt{GDPolyak}$ with $\mathtt{GD}$ and $\mathtt{Polyak}$ on an overaparameterized quadratic matrix sensing problem. Each measurement matrix is of the form $A_i = a_ia_i^T - \tilde{a}_i \tilde{a}_i^T$ where $a_i$ and $\tilde{a}_i$ are $d$-dimensional standard Gaussians. In this experiment, $d=100$, the unknown rank is $r = 2$, and the overparameterized rank is $k = 4$. For $\mathtt{GDPolyak}$, we run the method for $I=50$ epochs of size $K = 300$. In each epoch, $\mathtt{GDPolyak}$ uses constant stepsize $.05$.
Figure 4: Comparison of $\mathtt{GDPolyak}$ with $\mathtt{GD}$ and $\mathtt{Polyak}$ for the problem of learning a single neuron in the overparameterized regime. In the experiment, we set $d=100$ and $n = 2$. For $\mathtt{GDPolyak}$, we run the method for $I=50$ epochs of size $K = 100$. In each epoch, $\mathtt{GDPolyak}$ uses constant stepsize $1.5$. We note that since it is difficult to compute the exact distance to the set of minimizers of $S$ (defined explicitly in \ref{['eq:solutionsetdu']}), we instead compute a penalty, which can be shown to be proportional to ${\rm dist}(x_k, S)$.
Figure 5: The function $f(z)=(\|z\|-1)^2+\left\|\tfrac{z}{\|z\|}-e_2\right\|^4$ for $z\in {\mathbb R}^2$.

Theorems & Definitions (79)

Theorem 1.1: informal
Proposition 2.1: Range
Proposition 2.2: Tangents
Definition 2.3: Retraction
Proposition 2.4: Retractions as approximate projections
proof
Definition 3.1: Ravine
Definition 3.2: Morse ravine
Lemma 3.3: Smoothness of the Morse ravine
proof
...and 69 more

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

TL;DR

Abstract

Gradient descent with adaptive stepsize converges (nearly) linearly under fourth-order growth

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (79)