Convergence of optimizers implies eigenvalues filtering at equilibrium
Jerome Bolte, Quoc-Tung Le, Edouard Pauwels
TL;DR
The work reframes optimizer convergence as an eigenvalue-filtering consequence of hyperparameters within a general dynamical system x_{k+1}=G_α(x_k)=D x_k-α g(x_k), proving that successful runs satisfy ρ( Jac G_α( x̄)) ≤ 1 under mild assumptions and linking this to the edge-of-stability phenomenon. It develops a Hadamard–Perron–style stable manifold theorem and extends it to definable (semi-algebraic/o-minimal) settings, avoiding strong non-degeneracy or global Lipschitz requirements. The theory is instantiated for gradient descent, Polyak’s HB, Nesterov, and USAM, and two stronger variants—Two-step USAM and Hessian USAM—producing explicit curvature bounds like 0 ≤ λ ≤ 2/α, 0 ≤ λ ≤ 2(1+β)/α, and 0 ≤ λ(1+ρλ) ≤ 2(1+β)/α, among others. Numerical experiments on MNIST, Fashion-MNIST, and CIFAR-10 corroborate that USAM variants yield flatter minima and tighter eigenvalue filtering, offering explanations for EOS and guidance for designing optimizers with improved robustness to sharp minima. The results bridge dynamical-systems theory and deep-learning optimization, providing a principled route to wider basins and better generalization through controlled curvature at convergence.
Abstract
Ample empirical evidence in deep neural network training suggests that a variety of optimizers tend to find nearly global optima. In this article, we adopt the reversed perspective that convergence to an arbitrary point is assumed rather than proven, focusing on the consequences of this assumption. From this viewpoint, in line with recent advances on the edge-of-stability phenomenon, we argue that different optimizers effectively act as eigenvalue filters determined by their hyperparameters. Specifically, the standard gradient descent method inherently avoids the sharpest minima, whereas Sharpness-Aware Minimization (SAM) algorithms go even further by actively favoring wider basins. Inspired by these insights, we propose two novel algorithms that exhibit enhanced eigenvalue filtering, effectively promoting wider minima. Our theoretical analysis leverages a generalized Hadamard--Perron stable manifold theorem and applies to general semialgebraic $C^2$ functions, without requiring additional non-degeneracy conditions or global Lipschitz bound assumptions. We support our conclusions with numerical experiments on feed-forward neural networks.
