On Averaging and Extrapolation for Gradient Descent

Alan Luner; Benjamin Grimmer

On Averaging and Extrapolation for Gradient Descent

Alan Luner, Benjamin Grimmer

TL;DR

The paper analyzes how averaging and extrapolation of gradient-descent iterates affect worst-case convergence in smooth convex optimization, using Performance Estimation Problems to obtain exact guarantees. It proves that averaging the iterates cannot improve and often worsens performance for typical gradient-descent step patterns, while a simple extrapolation strategy with x_σ = x_0 + c(x_N − x_0) yields provable improvements that match the benefit of roughly O(sqrt(N/log N)) extra gradient steps, with the best results obtained for c up to a critical value c_crit ≈ 1 + Θ(1/√(N log N)). The results are shown to be tight via dual certificates and Hub er function constructions, and numerical experiments extend the insights to other first-order methods and real-world data, highlighting practical gains in memory-constrained settings. Overall, the work clarifies when extrapolation helps, quantifies the achievable gains, and demonstrates a cheap post-processing option to boost performance for smooth convex optimization.

Abstract

This work considers the effect of averaging, and more generally extrapolation, of the iterates of gradient descent in smooth convex optimization. After running the method, rather than reporting the final iterate, one can report either a convex combination of the iterates (averaging) or a generic combination of the iterates (extrapolation). For several common stepsize sequences, including recently developed accelerated periodically long stepsize schemes, we show averaging cannot improve gradient descent's worst-case performance and is, in fact, strictly worse than simply returning the last iterate. In contrast, we prove a conceptually simple and computationally cheap extrapolation scheme strictly improves the worst-case convergence rate: when initialized at the origin, reporting $(1+1/\sqrt{16N\log(N)})x_N$ rather than $x_N$ improves the best possible worst-case performance by the same amount as conducting $O(\sqrt{N/\log(N)})$ more gradient steps. Our analysis and characterizations of the best-possible convergence guarantees are computer-aided, using performance estimation problems. Numerically, we find similar (small) benefits from such simple extrapolation for a range of gradient methods.

On Averaging and Extrapolation for Gradient Descent

TL;DR

Abstract

rather than

improves the best possible worst-case performance by the same amount as conducting

more gradient steps. Our analysis and characterizations of the best-possible convergence guarantees are computer-aided, using performance estimation problems. Numerically, we find similar (small) benefits from such simple extrapolation for a range of gradient methods.

Paper Structure (34 sections, 16 theorems, 118 equations, 8 figures, 5 tables)

This paper contains 34 sections, 16 theorems, 118 equations, 8 figures, 5 tables.

Introduction
Our Contributions
Outline.
Preliminaries and Performance Estimation Problems
Performance Estimation Problems with Averaging/Extrapolation
Averaging is Strictly Worse than the Last Iterate
Common Structure Among Tight Last Iterate Convergence Guarantees
Constant Stepsizes $h_k = h \in (0,1]$
Constant Stepsizes $h_k = h \in (1,2)$
Dynamic Stepsizes $h_k \rightarrow 2$
Silver Stepsizes $h_k$ (often much greater than two)
Suboptimal Convergence from Averaging
Discussion of Conditions (\ref{['Eqn:ConditionFVal']}) and (\ref{['Eqn:ConditionGradNorm']})
Extension to Projected/Proximal Gradient Descent
Extrapolations are Strictly Better than the Last Iterate
...and 19 more sections

Key Result

Theorem 1.1

The optimal averaging of the iterates of gradient descent is to return the final iterate when the stepsizes $h_k$ are chosen as any of That is, for the stepsize sequences above, $\sigma = (0,\dots,0,1)$ is the unique minimizer of and

Figures (8)

Figure 1: Values of $h$ for which \ref{['Eqn:ConditionFVal']} holds (left) and \ref{['Eqn:ConditionGradNorm']} holds (right) for $N=2,3$. Numerically we find that this region fully aligns with the stepsizes for which averaging is not beneficial (See \ref{['Conj:C1C2Converse']}).
Figure 2: Optimal extrapolation choice of $\sigma$ for $N=10$ under different performance measures. Under these respective choices of $\sigma$, the performance guarantees for $L=D=1$ are $0.02000$ for objective gap and $0.08443$ for gradient norm.
Figure 3: Near-optimality of $c_\mathrm{crit}$ for objective gap and gradient norm with $N=7, L=D=h=1$.
Figure 4: The critical and optimal extrapolation factors for objective gap and gradient norm.
Figure 5: Similarity of critical and optimal factors and divergence between objective and gradient.
...and 3 more figures

Theorems & Definitions (29)

Theorem 1.1: Informal, The Optimal Averaging is to Not Average
Remark 1
Theorem 1.2: Informal, Strict Gain of $\sqrt{N/\log(N)}$ from Simple Extrapolation
Lemma 3.1
proof
Conjecture 3.2
Theorem 3.1
proof
Theorem 3.2
proof
...and 19 more

On Averaging and Extrapolation for Gradient Descent

TL;DR

Abstract

On Averaging and Extrapolation for Gradient Descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (29)