Table of Contents
Fetching ...

Inertial Newton Algorithms Avoiding Strict Saddle Points

Camille Castera

TL;DR

This paper analyzes two second-order optimization dynamics—DIN, a Newton-like inertial system, and INNA, its discretized variant—for non-convex objectives $\mathcal{J}$. It proves that with fixed positive parameters $\alpha>0$, $\beta>0$, these dynamics almost surely avoid strict saddle points, leveraging the stable manifold theorem; INNA also admits convergence results under suitable step-size $\gamma$ and Lipschitz assumptions. The work further characterizes behavior near minimizers via the Hartman–Grobman theorem, showing potential spiraling around minimizers depending on $\alpha\beta$ and Hessian spectra, with numerical demonstrations. Overall, the results provide theoretical guarantees and qualitative insights into the saddle-avoidance and near-minimizer dynamics of inertial Newton-like methods in non-convex optimization, with practical implications for neural network training and related applications.

Abstract

We study the asymptotic behavior of second-order algorithms mixing Newton's method and inertial gradient descent in non-convex landscapes. We show that, despite the Newtonian behavior of these methods, they almost always escape strict saddle points. We also evidence the role played by the hyper-parameters of these methods in their qualitative behavior near critical points. The theoretical results are supported by numerical illustrations.

Inertial Newton Algorithms Avoiding Strict Saddle Points

TL;DR

This paper analyzes two second-order optimization dynamics—DIN, a Newton-like inertial system, and INNA, its discretized variant—for non-convex objectives . It proves that with fixed positive parameters , , these dynamics almost surely avoid strict saddle points, leveraging the stable manifold theorem; INNA also admits convergence results under suitable step-size and Lipschitz assumptions. The work further characterizes behavior near minimizers via the Hartman–Grobman theorem, showing potential spiraling around minimizers depending on and Hessian spectra, with numerical demonstrations. Overall, the results provide theoretical guarantees and qualitative insights into the saddle-avoidance and near-minimizer dynamics of inertial Newton-like methods in non-convex optimization, with practical implications for neural network training and related applications.

Abstract

We study the asymptotic behavior of second-order algorithms mixing Newton's method and inertial gradient descent in non-convex landscapes. We show that, despite the Newtonian behavior of these methods, they almost always escape strict saddle points. We also evidence the role played by the hyper-parameters of these methods in their qualitative behavior near critical points. The theoretical results are supported by numerical illustrations.

Paper Structure

This paper contains 24 sections, 12 theorems, 43 equations, 4 figures.

Key Result

Theorem 3.1

Assume that $\mathcal{J}$ is a Morse function, then for almost any initialization, the corresponding solution of eq::DIN does not converge to a point in $\mathsf{S}_{<0}$.

Figures (4)

  • Figure 1: Example of two functions with non-strict saddle. On the left figure, $(0,0)$ is a minimum, but on the right, the critical point $(0,0)$ is neither a minimum nor a maximum.
  • Figure 2: Illustration of the spiral phenomenon. Left: trajectory on the landscape of $\mathcal{J}$ with two zooms on bottom-left figures. Right: value and distance to the minimizer against iterations.
  • Figure 3: Evolution of the iterates of INNA on the landscape of the 2D function $\mathcal{J}$ of Section \ref{['sec::numexp']} for two choices of $(\alpha,\beta)$. Red and blue surfaces represent locally concave and convex parts of $\mathcal{J}$ respectively. Left figure corresponds to initializations on the stable manifold of $(0,0)$, which yield convergence to $(0,0)$. Right figure represents initializations outside the manifold and convergence to local minimizers.
  • Figure 4: Tables of variations for the proof of Theorem \ref{['thm:INNAdiffeo']}. The sign of $h"$ allows deducing the variations and signs of $h'$ and $h$ which themselves allow deducing the minima of $\gamma^-$.

Theorems & Definitions (22)

  • Definition 1
  • Theorem 3.1
  • Corollary 3.2
  • proof : Proof of Corollary \ref{['cor::dincor']}
  • Theorem 3.3: perko2013differential
  • Remark 3.4
  • proof : Proof of Theorem \ref{['thm::MainResDIN']}
  • Lemma 3.5
  • Lemma 3.6
  • Theorem 3.7: Hartman–Grobman perko2013differential
  • ...and 12 more