Table of Contents
Fetching ...

Convergence of optimizers implies eigenvalues filtering at equilibrium

Jerome Bolte, Quoc-Tung Le, Edouard Pauwels

TL;DR

The work reframes optimizer convergence as an eigenvalue-filtering consequence of hyperparameters within a general dynamical system x_{k+1}=G_α(x_k)=D x_k-α g(x_k), proving that successful runs satisfy ρ( Jac G_α( x̄)) ≤ 1 under mild assumptions and linking this to the edge-of-stability phenomenon. It develops a Hadamard–Perron–style stable manifold theorem and extends it to definable (semi-algebraic/o-minimal) settings, avoiding strong non-degeneracy or global Lipschitz requirements. The theory is instantiated for gradient descent, Polyak’s HB, Nesterov, and USAM, and two stronger variants—Two-step USAM and Hessian USAM—producing explicit curvature bounds like 0 ≤ λ ≤ 2/α, 0 ≤ λ ≤ 2(1+β)/α, and 0 ≤ λ(1+ρλ) ≤ 2(1+β)/α, among others. Numerical experiments on MNIST, Fashion-MNIST, and CIFAR-10 corroborate that USAM variants yield flatter minima and tighter eigenvalue filtering, offering explanations for EOS and guidance for designing optimizers with improved robustness to sharp minima. The results bridge dynamical-systems theory and deep-learning optimization, providing a principled route to wider basins and better generalization through controlled curvature at convergence.

Abstract

Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.

Convergence of optimizers implies eigenvalues filtering at equilibrium

TL;DR

The work reframes optimizer convergence as an eigenvalue-filtering consequence of hyperparameters within a general dynamical system x_{k+1}=G_α(x_k)=D x_k-α g(x_k), proving that successful runs satisfy ρ( Jac G_α( x̄)) ≤ 1 under mild assumptions and linking this to the edge-of-stability phenomenon. It develops a Hadamard–Perron–style stable manifold theorem and extends it to definable (semi-algebraic/o-minimal) settings, avoiding strong non-degeneracy or global Lipschitz requirements. The theory is instantiated for gradient descent, Polyak’s HB, Nesterov, and USAM, and two stronger variants—Two-step USAM and Hessian USAM—producing explicit curvature bounds like 0 ≤ λ ≤ 2/α, 0 ≤ λ ≤ 2(1+β)/α, and 0 ≤ λ(1+ρλ) ≤ 2(1+β)/α, among others. Numerical experiments on MNIST, Fashion-MNIST, and CIFAR-10 corroborate that USAM variants yield flatter minima and tighter eigenvalue filtering, offering explanations for EOS and guidance for designing optimizers with improved robustness to sharp minima. The results bridge dynamical-systems theory and deep-learning optimization, providing a principled route to wider basins and better generalization through controlled curvature at convergence.

Abstract

Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.

Paper Structure

This paper contains 19 sections, 13 theorems, 56 equations, 3 figures.

Key Result

Theorem 1.1

Let $D \in \mathbb{R}^{m \times m}$ be an invertible matrix, $g: \mathbb{R}^m \to \mathbb{R}^m$ be a $C^1$ semi-algebraic mapping. For almost all $x_0 \in \mathbb{R}^m$ and $\alpha>0$ the following assertion holds true: if the sequence $(x_k)_{k\in \mathbb{N}}$ converges to some point $\bar{x}$, the

Figures (3)

  • Figure 1: (Experiment 1) - MLP trained on MNIST with stochastic gradient descent and its corresponding to SAM, USAM, USAM2 and Hessian USAM versions. Left without momentum, right with $\beta = 0.9$. SAM, USAM and USAM2 are trained with $\rho \in \{0.05, 0.1, 0.2\}$ while Hessian USAM is trained with $\rho \in \{0.01, 0.02, 0.05, 0.1, 0.2\}$. Among these $\rho$, we choose those yielding the best models (in terms of test accuracy) and report their accuracy and hessian spectra.
  • Figure 2: (Experiment 2) - Same as \ref{['fig:mnist-sgd']} with the MNIST-FASHION dataset.
  • Figure 3: (Experiment 3) - Models are trained with stochastic gradient descent and its corresponding to SAM, USAM, USAM2 and Hessian USAM versions. The values of $\rho$ of all SAM-like algorithms are set at $\rho = 0.001$.

Theorems & Definitions (31)

  • Theorem 1.1: Successful runs imply nonexpansiveness at equilibrium
  • Theorem 2.1
  • Example 2.2
  • Theorem 2.3: Refined version of stable center manifold theorem
  • Definition 2.4: Semi-algebraic sets and functions
  • Example 2.5: Semi-algebraic functions
  • Lemma 2.6
  • Remark 2.7: Beyond semi-algebraicity
  • Proposition 3.1: Gradient descent eigenvalues filtering
  • Proposition 3.2: Heavy Ball eigenvalues filtering
  • ...and 21 more