Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

Jason M. Altschuler; Pablo A. Parrilo

Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

Jason M. Altschuler, Pablo A. Parrilo

TL;DR

The paper demonstrates that gradient descent can be accelerated without momentum by carefully designing time-varying, non-monotone stepsizes, introducing the Silver Stepsize Schedule. This schedule yields a convergence rate that interpolates between the textbook unaccelerated rate and Nesterov-style acceleration, with a phase transition at a horizon $n^* = \Theta(\kappa^{\log_{\rho} 2})$ and an overall iteration complexity of $n = \Theta(\kappa^{\log_{\rho} 2} \log(1/\varepsilon))$ to reach accuracy $\varepsilon$ for $\kappa$-conditioned functions; the rate is shown to be partially optimal and is proven via multi-step descent, recursive certificates, and hedging arguments. The approach relies on a fully explicit recursive construction of the stepsize schedule, a fractal-like structure, and a rigorous certificate using co-coercivity and interpolation tools. The results extend to non-strongly convex settings via black-box reductions and suggest new directions for acceleration without altering the GD framework. Overall, this work challenges the long-held belief that acceleration requires momentum, by showing that dynamic, well-structured stepsizes can achieve substantially faster convergence in smooth convex optimization.

Abstract

Can we accelerate convergence of gradient descent without changing the algorithm -- just by carefully choosing stepsizes? Surprisingly, we show that the answer is yes. Our proposed Silver Stepsize Schedule optimizes strongly convex functions in $k^{\log_ρ 2} \approx k^{0.7864}$ iterations, where $ρ=1+\sqrt{2}$ is the silver ratio and $k$ is the condition number. This is intermediate between the textbook unaccelerated rate $k$ and the accelerated rate $\sqrt{k}$ due to Nesterov in 1983. The non-strongly convex setting is conceptually identical, and standard black-box reductions imply an analogous accelerated rate $\varepsilon^{-\log_ρ 2} \approx \varepsilon^{-0.7864}$. We conjecture and provide partial evidence that these rates are optimal among all possible stepsize schedules. The Silver Stepsize Schedule is constructed recursively in a fully explicit way. It is non-monotonic, fractal-like, and approximately periodic of period $k^{\log_ρ 2}$. This leads to a phase transition in the convergence rate: initially super-exponential (acceleration regime), then exponential (saturation regime).

Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

TL;DR

and an overall iteration complexity of

to reach accuracy

for

-conditioned functions; the rate is shown to be partially optimal and is proven via multi-step descent, recursive certificates, and hedging arguments. The approach relies on a fully explicit recursive construction of the stepsize schedule, a fractal-like structure, and a rigorous certificate using co-coercivity and interpolation tools. The results extend to non-strongly convex settings via black-box reductions and suggest new directions for acceleration without altering the GD framework. Overall, this work challenges the long-held belief that acceleration requires momentum, by showing that dynamic, well-structured stepsizes can achieve substantially faster convergence in smooth convex optimization.

Abstract

iterations, where

is the silver ratio and

is the condition number. This is intermediate between the textbook unaccelerated rate

and the accelerated rate

due to Nesterov in 1983. The non-strongly convex setting is conceptually identical, and standard black-box reductions imply an analogous accelerated rate

. We conjecture and provide partial evidence that these rates are optimal among all possible stepsize schedules. The Silver Stepsize Schedule is constructed recursively in a fully explicit way. It is non-monotonic, fractal-like, and approximately periodic of period

. This leads to a phase transition in the convergence rate: initially super-exponential (acceleration regime), then exponential (saturation regime).

Paper Structure (34 sections, 14 theorems, 87 equations, 7 figures, 1 table)

This paper contains 34 sections, 14 theorems, 87 equations, 7 figures, 1 table.

Introduction
Contribution and discussion
Main result: acceleration without momentum
Discussion of Silver Convergence Rate
Discussion of Silver Stepsize Schedule
Discussion of problem setting
Related work
The special case of quadratic optimization
The general case of convex optimization
Organization
Conceptual overview: two-step case ($n=2$)
Optimal stepsizes for quadratic optimization
Optimal stepsizes for convex optimization
Silver Stepsize Schedule for $n=2$
Upper bound: rate certification via multi-step descent
...and 19 more sections

Key Result

Theorem 1.1

For any horizon $n \in \mathbb{N}$ that is a power of $2$, any dimension $d$, any $\kappa$-conditioned function $f : \mathbb{R}^d \to \mathbb{R}$, and any initialization $x_0$, where $x^*$ denotes the unique minimizer of $f$, $x_n$ denotes the output of $n$ steps of GD using the Silver Stepsize Schedule (defined in §sec:construction), and $\tau_n$ denotes the $n$-step Silver Convergence Rate (def

Figures (7)

Figure 1: Silver Stepsize Schedule, for different condition numbers $\kappa = 4,16,64,256$ -- only the first 64 stepsizes are shown. Notice the recursive, fractal behavior and the approximate periodicity with period of size $n^* = \kappa^{\log_{\rho} 2}$; details in §\ref{['sssec:intro:dis:schedule']}. Also note the different scales on the vertical axis, since the stepsizes are unnormalized, vs. the normalized stepsizes in Figure \ref{['fig:normalizedstepsizes']}.
Figure 2: Log of the average per-step rate, aka $\tfrac{1}{n} \log \tau_n$, for varying condition numbers $\kappa$. The initial value is the unaccelerated rate $(\tfrac{\kappa -1}{\kappa + 1})^2$. Notice the rate saturation phenomenon that occurs at $n=n^* \asymp \kappa^{\log_{\rho} 2}$.
Figure 3: Contour plots of worst-case rates, as a function of the two stepsizes $\alpha$ and $\beta$, for $m=1/4$ and $M=1$. The marked points indicate the global minima. Notice the asymmetry in the convex case (right), due to the non-commutativity of the GD map.
Figure 4: Cobweb plot describing the evolution of $z_n$, under the iteration $z_{n/2} \mapsto z_n$ given in \ref{['eq:yz-defining']} and \ref{['eq:yz-recur']}. The initial condition is $1/\kappa$ (in this plot, $\kappa=32$). The iterates grow exponentially when $z$ is near zero, and converge quadratically to $1$ when $z$ is close to $1$.
Figure 5: Normalized Silver Stepsize Schedule, for different condition numbers $\kappa = 4,16,64,256$. Notice that these are always bounded between 0 and 1. The Silver Stepsize Schedules $h^{(n)}$ shown in Figure \ref{['fig:stepsizes']} are generated by applying $\psi$ to the schedules here.
...and 2 more figures

Theorems & Definitions (32)

Theorem 1.1
Remark 2.1
Theorem 2.2: Optimal $2$-step schedule for strongly convex optimization, Theorem 8.11 of altschuler2018greed
proof : Proof of rate upper bound for Theorem \ref{['thm:2step:convex']}
proof : Proof of rate lower bound in Theorem \ref{['thm:2step:convex']}
Lemma 3.1: Basic properties of the Normalized Silver Stepsizes
Lemma 3.2: Basic properties of the Silver Stepsizes
Remark 3.3: Occupation measure
Theorem 4.1: Silver Convergence Rate
Lemma 4.2: Dynamics in the acceleration regime
...and 22 more

Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

TL;DR

Abstract

Acceleration by Stepsize Hedging I: Multi-Step Descent and the Silver Stepsize Schedule

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (32)