Table of Contents
Fetching ...

Continuized Nesterov Momentum Achieves the $O(\varepsilon^{-7/4})$ Complexity without Additional Mechanisms

Julien Hermant, Jean-François Aujol, Charles Dossal, Lorick Huang, Aude Rondepierre

TL;DR

The paper proves that a continuized Nesterov momentum method, with stochastic but function-independent parameters and no safeguard mechanisms, attains the same $O(\varepsilon^{-7/4})$ complexity for finding an $\varepsilon$-stationary point as prior safeguarded methods. By blending continuous momentum dynamics with random gradient updates (via a Poisson process) and analyzing a Poisson-averaged trajectory, the authors derive convergence in expectation under Lipschitz gradient and Hessian assumptions. Under these conditions, they establish a rate of $\mathcal{O}(n^{-4/7})$ for the gradient norm in a suitably weighted sense, which implies the target complexity bound when expressed in terms of gradient evaluations. The results hinge on a careful transfer from a continuous-time analysis to a discrete algorithm, and they reveal that safeguards may not be fundamentally necessary for accelerated first-order non-convex optimization in this setting. Limitations include dependence on a random event $\mathcal{A}_n$ and a normalization term $\Delta_n/\mathbb{E}[\Delta_n]$, though empirical evidence suggests these are mild and likely removable in future work.

Abstract

For first-order optimization of non-convex functions with Lipschitz continuous gradient and Hessian, the best known complexity for reaching an $\varepsilon$-approximation of a stationary point is $O(\varepsilon^{-7/4})$. Existing algorithms achieving this bound are based on momentum, but are always complemented with safeguard mechanisms, such as restarts or negative-curvature exploitation steps. Whether such mechanisms are fundamentally necessary has remained an open question. Leveraging the continuized method, we show that a Nesterov momentum algorithm with stochastic parameters alone achieves the same complexity in expectation. This result holds up to a multiplicative stochastic factor with unit expectation and a restriction to a subset of the realizations, both of which are independent of the objective function. We empirically verify that these constitute mild limitations.

Continuized Nesterov Momentum Achieves the $O(\varepsilon^{-7/4})$ Complexity without Additional Mechanisms

TL;DR

The paper proves that a continuized Nesterov momentum method, with stochastic but function-independent parameters and no safeguard mechanisms, attains the same complexity for finding an -stationary point as prior safeguarded methods. By blending continuous momentum dynamics with random gradient updates (via a Poisson process) and analyzing a Poisson-averaged trajectory, the authors derive convergence in expectation under Lipschitz gradient and Hessian assumptions. Under these conditions, they establish a rate of for the gradient norm in a suitably weighted sense, which implies the target complexity bound when expressed in terms of gradient evaluations. The results hinge on a careful transfer from a continuous-time analysis to a discrete algorithm, and they reveal that safeguards may not be fundamentally necessary for accelerated first-order non-convex optimization in this setting. Limitations include dependence on a random event and a normalization term , though empirical evidence suggests these are mild and likely removable in future work.

Abstract

For first-order optimization of non-convex functions with Lipschitz continuous gradient and Hessian, the best known complexity for reaching an -approximation of a stationary point is . Existing algorithms achieving this bound are based on momentum, but are always complemented with safeguard mechanisms, such as restarts or negative-curvature exploitation steps. Whether such mechanisms are fundamentally necessary has remained an open question. Leveraging the continuized method, we show that a Nesterov momentum algorithm with stochastic parameters alone achieves the same complexity in expectation. This result holds up to a multiplicative stochastic factor with unit expectation and a restriction to a subset of the realizations, both of which are independent of the objective function. We empirically verify that these constitute mild limitations.
Paper Structure (36 sections, 27 theorems, 168 equations, 3 figures, 3 algorithms)

This paper contains 36 sections, 27 theorems, 168 equations, 3 figures, 3 algorithms.

Key Result

Theorem 1.1

Let $\mathcal{A}_n \subset \Omega$ a subset of the realizations and a random variable $\chi_n >0$ verifying $\mathbb{E}\left[\chi_n \right] = 1$, for some $n \in {\mathbb{N}}^\ast$. For $f$ with Lipschitz gradient and Hessian, after $n$ iterations, alg:intro properly parameterized outputs a point $\

Figures (3)

  • Figure 1: For $100$ realizations of sequences $\{ T_k \}_{k\in \{1,\dots,10000 \}}$, (a) shows the evolution of the max over all the realizations of $H_0^i -5\mathbb{E}\left[H_0^i \right]$, similarily in (b) and (c), for $H_1^i -5\mathbb{E}\left[H_1^i \right]$ and $H_2^i -5\mathbb{E}\left[H_2^i \right]$, see Definition \ref{['ass:gamma_markov']}. All the realizations belong to $\mathcal{A}_n$, and it is even more pronounced as $n$ is larger. For $100$ realizations of sequences $\{ T_k \}_{k\in \{1,\dots,1000 \}}$, (d) shows the evolution of $\Delta_n/\mathbb{E}\left[\Delta_n \right]$, see its definition in Theorem \ref{['thm:hess_lip']}. The min, max, and mean are taken at each iteration among all the realizations. Observe that the ratio concentrates around $1$.
  • Figure 2: Histogram distribution of the centered laws of $H_j^i$ for $j =0,1,2$ and $i = 2,10,100$, see Definition \ref{['ass:gamma_markov']}. The underlying $n$ is $100$.
  • Figure 3: Decrease of $\log f$ values along the iterations of gradient descent and \ref{['alg:constant_param_det']}, with $f$ defined by the matrix factorization problem \ref{['prob:matrix_factor']}. Because \ref{['alg:constant_param_det']} is a stochastic algorithm, we averaged for $10$ runs the functions values iteration-wise. We observe a faster decrease in favor of \ref{['alg:constant_param_det']}.

Theorems & Definitions (40)

  • Theorem 1.1: Informal version
  • Proposition 2.3: hermant2025continuized, Proposition 5
  • Lemma 3.1
  • Proposition 3.2
  • Definition 4.1
  • Theorem 4.2
  • Lemma 5.1
  • Lemma 5.2
  • Lemma 5.3
  • Lemma 5.4
  • ...and 30 more