Table of Contents
Fetching ...

The Nesterov-Spokoiny Acceleration Achieves Strict $o(1/k^2)$ Convergence

Weibin Peng, Yu Liu, Tianyu Wang

TL;DR

The paper introduces the Nesterov--Spokoiny Acceleration (NSA), a momentum-based scheme that preserves monotone descent while achieving fast convergence for smooth convex objectives, attaining $o(1/k^2)$ in function value and $o(1/(k^3 \log k))$ in the squared gradient. It extends NSA to inexact gradient (zeroth-order) oracles and to nonsmooth/composite objectives with proximal updates, preserving the same accelerated rates in function value and giving descent guarantees even in nonconvex settings. A continuous-time analysis connects NSA to high-resolution ODEs, yielding a system that explains the acceleration phenomenon and yields an $O(1/t^2)$ rate in the convex setting. The paper also provides extensive experiments comparing NSA variants to standard accelerators, demonstrating practical speedups and highlighting the effectiveness of the zeroth-order and composite extensions. Overall, NSA offers a unified framework combining acceleration with guaranteed descent and broad applicability across smooth, nonsmooth, and zeroth-order optimization problems.

Abstract

This paper studies the Nesterov-Spokoiny Acceleration (NSA), a variant of the accelerated gradient method by Nesterov and Spokoiny. For smooth convex optimization, NSA achieves a strict $o(1/k^2)$ convergence rate in function value and an $o(1/(k^3 \log k))$ rate in squared gradient norm, while ensuring monotonic descent of the objective. We further study a zeroth-order version of NSA that handles inexact gradients, and extends NSA to composite optimization problems, in each case establishing $o(1/k^2)$ convergence in function value. A continuous-time analysis reveals connections to high-resolution ODEs known to underlie acceleration phenomena.

The Nesterov-Spokoiny Acceleration Achieves Strict $o(1/k^2)$ Convergence

TL;DR

The paper introduces the Nesterov--Spokoiny Acceleration (NSA), a momentum-based scheme that preserves monotone descent while achieving fast convergence for smooth convex objectives, attaining in function value and in the squared gradient. It extends NSA to inexact gradient (zeroth-order) oracles and to nonsmooth/composite objectives with proximal updates, preserving the same accelerated rates in function value and giving descent guarantees even in nonconvex settings. A continuous-time analysis connects NSA to high-resolution ODEs, yielding a system that explains the acceleration phenomenon and yields an rate in the convex setting. The paper also provides extensive experiments comparing NSA variants to standard accelerators, demonstrating practical speedups and highlighting the effectiveness of the zeroth-order and composite extensions. Overall, NSA offers a unified framework combining acceleration with guaranteed descent and broad applicability across smooth, nonsmooth, and zeroth-order optimization problems.

Abstract

This paper studies the Nesterov-Spokoiny Acceleration (NSA), a variant of the accelerated gradient method by Nesterov and Spokoiny. For smooth convex optimization, NSA achieves a strict convergence rate in function value and an rate in squared gradient norm, while ensuring monotonic descent of the objective. We further study a zeroth-order version of NSA that handles inexact gradients, and extends NSA to composite optimization problems, in each case establishing convergence in function value. A continuous-time analysis reveals connections to high-resolution ODEs known to underlie acceleration phenomena.
Paper Structure (16 sections, 14 theorems, 96 equations, 4 figures, 1 table, 3 algorithms)

This paper contains 16 sections, 14 theorems, 96 equations, 4 figures, 1 table, 3 algorithms.

Key Result

Theorem 1

Instate all notations in Algorithm alg:nsa. Let $p \ge 3$, and let $\alpha_k = \frac{p}{k+p}$ for each $k \in \mathbb{N}$. Consider an objective function $f \in \mathscr{F}_{L}^{1,1} (\mathbb{R}^n)$. If the objective $f$ is $L$-smooth and satisfies $\inf_{x \in \mathbb{R}^n} f (x) > - \infty$ ($f$ is possibly nonconvex), then the following holds.

Figures (4)

  • Figure 1: The figure above illustrates a comparison between Algorithm 1 and the original algorithm by Nesterov and Spokoiny 2017Random, which shows their paths in one iteration from the same starting point. The curves represent the contour lines. The points that are underlined are the new iteration points of Algorithm 1, while the rest new points are from the original algorithm by Nesterov and Spokoiny 2017Random. The red and green dashed lines represent the function values of the next step for $x_t$, respectively.
  • Figure 2: Comparison of Algorithm \ref{['alg:nsa-comp']} (referred to as 'New NSA') with benchmark methods. In the graph legend, GD represents Gradient Descent; NSA denotes the original acceleration method by Nesterov and Spokoiny 2017Random; and FISTA and AFBM methods were introduced in 2009A and nesterov1983methodNesterov2013, respectively. In this context, $p$ is used to indicate the damping factor. As the graphs for NSA and AFBM are closely aligned when using the same damping factor, we present only the NSA method with a damping factor of $p = 3$ and the AFBM method with a damping factor of $p = 4$.
  • Figure 3: Comparison of Algorithm \ref{['alg:nsa-comp']} (referred to as 'New NSA') with benchmark methods ($y$-axis in logarithmic scale). In the graph legend, GD represents Gradient Descent; NSA denotes the original acceleration method by Nesterov and Spokoiny 2017Random; and FISTA and AFBM methods were introduced in 2009A and nesterov1983methodNesterov2013, respectively. In this context, $p$ is used to indicate the damping factor. We present only the NSA method with a damping factor of $p = 5$ and the AFBM method with a damping factor of $p = 6$.
  • Figure 4: Comparison of Algorithm \ref{['alg:nsa-inexact']} (referred to as 'New NSA') with benchmark methods. In the graph legend, GD represents Gradient Descent; NSA denotes the original acceleration method by Nesterov and Spokoiny 2017Random; and FISTA and AFBM methods were introduced in 2009A and nesterov1983methodNesterov2013, respectively. In this context, $p$ is used to indicate the damping factor. As the graphs for NSA and AFBM are closely aligned when using the same damping factor, we present only the NSA method with a damping factor of $p = 3$ and the AFBM method with a damping factor of $p = 4$. In this experiments, all methods use inexact gradient obtained from zeroth-order information.

Theorems & Definitions (28)

  • Remark 1
  • Theorem 1
  • Proposition 1
  • proof
  • Lemma 1
  • proof
  • proof : Proof of the first two items in Theorem \ref{['thm:main']}
  • Proposition 2
  • proof
  • Theorem 2: flaxman2005onlinenesterov2017random10.1093/imaiai/iaad014
  • ...and 18 more