On Averaging and Extrapolation for Gradient Descent
Alan Luner, Benjamin Grimmer
TL;DR
The paper analyzes how averaging and extrapolation of gradient-descent iterates affect worst-case convergence in smooth convex optimization, using Performance Estimation Problems to obtain exact guarantees. It proves that averaging the iterates cannot improve and often worsens performance for typical gradient-descent step patterns, while a simple extrapolation strategy with x_σ = x_0 + c(x_N − x_0) yields provable improvements that match the benefit of roughly O(sqrt(N/log N)) extra gradient steps, with the best results obtained for c up to a critical value c_crit ≈ 1 + Θ(1/√(N log N)). The results are shown to be tight via dual certificates and Hub er function constructions, and numerical experiments extend the insights to other first-order methods and real-world data, highlighting practical gains in memory-constrained settings. Overall, the work clarifies when extrapolation helps, quantifies the achievable gains, and demonstrates a cheap post-processing option to boost performance for smooth convex optimization.
Abstract
This work considers the effect of averaging, and more generally extrapolation, of the iterates of gradient descent in smooth convex optimization. After running the method, rather than reporting the final iterate, one can report either a convex combination of the iterates (averaging) or a generic combination of the iterates (extrapolation). For several common stepsize sequences, including recently developed accelerated periodically long stepsize schemes, we show averaging cannot improve gradient descent's worst-case performance and is, in fact, strictly worse than simply returning the last iterate. In contrast, we prove a conceptually simple and computationally cheap extrapolation scheme strictly improves the worst-case convergence rate: when initialized at the origin, reporting $(1+1/\sqrt{16N\log(N)})x_N$ rather than $x_N$ improves the best possible worst-case performance by the same amount as conducting $O(\sqrt{N/\log(N)})$ more gradient steps. Our analysis and characterizations of the best-possible convergence guarantees are computer-aided, using performance estimation problems. Numerically, we find similar (small) benefits from such simple extrapolation for a range of gradient methods.
