Table of Contents
Fetching ...

Nesterov acceleration in benignly non-convex landscapes

Kanan Gupta, Stephan Wojtowytsch

TL;DR

The paper addresses the gap between theory and practice for momentum-based optimization in non-convex settings by introducing a moving-closest-minimizer geometry via the projection $\pi(x)$ and a $\mu$-strong aiming condition. It proves accelerated convergence for continuous-time heavy-ball dynamics and discrete-time Nesterov schemes, including stochastic variants with additive and multiplicative noise, under weaker geometric assumptions than global convexity. The main contributions are a continuous-time convergence bound with a Lyapunov energy, and discrete-time rates showing acceleration (compared to gradient methods) while accounting for tangential motion along a minimizer manifold. The results align with deep learning landscapes by capturing local convexity toward the minimizer manifold and demonstrate that acceleration can persist locally in benign non-convex landscapes, with implications for algorithm design in overparameterized models.

Abstract

While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a `benign' non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov's accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

Nesterov acceleration in benignly non-convex landscapes

TL;DR

The paper addresses the gap between theory and practice for momentum-based optimization in non-convex settings by introducing a moving-closest-minimizer geometry via the projection and a -strong aiming condition. It proves accelerated convergence for continuous-time heavy-ball dynamics and discrete-time Nesterov schemes, including stochastic variants with additive and multiplicative noise, under weaker geometric assumptions than global convexity. The main contributions are a continuous-time convergence bound with a Lyapunov energy, and discrete-time rates showing acceleration (compared to gradient methods) while accounting for tangential motion along a minimizer manifold. The results align with deep learning landscapes by capturing local convexity toward the minimizer manifold and demonstrate that acceleration can persist locally in benign non-convex landscapes, with implications for algorithm design in overparameterized models.

Abstract

While momentum-based optimization algorithms are commonly used in the notoriously non-convex optimization problems of deep learning, their analysis has historically been restricted to the convex and strongly convex setting. In this article, we partially close this gap between theory and practice and demonstrate that virtually identical guarantees can be obtained in optimization problems with a `benign' non-convexity. We show that these weaker geometric assumptions are well justified in overparametrized deep learning, at least locally. Variations of this result are obtained for a continuous time model of Nesterov's accelerated gradient descent algorithm (NAG), the classical discrete time version of NAG, and versions of NAG with stochastic gradient estimates with purely additive noise and with noise that exhibits both additive and multiplicative scaling.

Paper Structure

This paper contains 22 sections, 24 theorems, 166 equations, 4 figures.

Key Result

Lemma 4

wojtowytsch2023stochastic Let $f:\mathbb R^d\to\mathbb R$ be a $C^2$-function and $\mathcal{M} = \{x\in\mathbb R^d : f(x) = \inf f\}$. Assume that $\mathcal{M}$ is a $k$-dimensional $C^1$-submanifold of $\mathbb R^d$, $z\in \mathcal{M}$, $T_z\mathcal{M}$ the tangent space at $z$, $r>0$. If $\mathcal

Figures (4)

  • Figure 1: We visualize $f$ from Example \ref{['example 1d']} in the top row and its derivative in the bottom row with $R=2$ and $=0.2$ (left), $=0.1$ (middle) and $= 0.05$ (right). Left: $f$ has many local minimizers as the derivative crosses $0$ an infinite number of times. Middle: $f$ satisfies the PL condition, but not the strong aiming condition. Right: $f$ is strongly aiming (with respect to the unique global minimizer, which implies the PL condition). In all plots, $f$ is non-convex since $f'$ is non-monotone.
  • Figure 2: Left: The dashed red line connects two minimizers of the function $f$. Along the line, $f$ must achieve an interior local maximum. At this point, the Hessian $D^2f$ cannot be positive definite. Middle, Right: Optimization trajectories for Nesterov's method (top) and its associated energy curve (bottom). The selection of limit point may depend crucially on optimization parameters: In the middle plot, we take 800 steps with stepsize $10^{-2}$ while on the right, we take 8,000 steps with stepsize $10^{-3}$ from the same initial point. The decay of $f(x_t)$ is similar for both trajectories, but the limit points on the manifold of minimizers are far apart. The objective function is $f(x,y) = (x^2/2 + 3y^2-1)^2$.
  • Figure 4: We compare the trajectories of gradient descent and Nesterov's algorithm for the objective function $f$ in Example \ref{['example 1d returns']} with $R=6$ and $= 0.075$ (left), $= 0.08$ (middle) and $= 0.085$ (right). Evidently, if $\sqrt{1+4R^2}$ is very close to the threshold value 1, gradient descent outperforms Nesterov's algorithm with the theoretically guaranteed parameters.
  • Figure : Convexity analysis of $\phi(t) = L(w+tg)$ for $w$ near global minimizers of a loss function $L$ and $g = \nabla L(w) / \|\nabla L(w)\|$. Left: $\phi(t)$, middle: second derivative of $\phi$, right: estimated strong aiming parameter $\mu$ for $t \in [-1,1]$. Evidently, $\phi$ is strongly convex in a neighborhood of the minimizers. Strong aiming condition yields consistently larger constants than the strong convexity parameter obtained from second derivatives. Different colors correspond to different random initializations.

Theorems & Definitions (46)

  • Example 1
  • Example 2
  • Example 3
  • Lemma 4
  • Lemma 4
  • Theorem 5
  • Lemma 5
  • Theorem 6
  • Remark 7
  • Lemma 7
  • ...and 36 more