Table of Contents
Fetching ...

Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD

Dmitry Dudukalov, Artem Logachov, Vladimir Lotov, Timofei Prasolov, Evgeny Prokopenko, Anton Tarasenko

TL;DR

This paper analyzes stochastic gradient descent in a one-dimensional landscape under additive noise with either infinite or finite variance. It establishes precise time-scaling regimes under which SGD converges to the local basin minimum containing the initial point, and it characterizes lingering near high-order critical points and escape from sharp maxima via random-walk arguments. The results reveal metastable dynamics: on longer time scales the process behaves like a Markov chain over local minima, with escape probabilities from sharp maxima computable from drift parameters. The work provides a rigorous probabilistic framework for understanding SGD transitions in non-convex settings, highlighting the roles of noise tails and local geometry and offering insights applicable to initialization and step-size choices.

Abstract

We study the convergence properties and escape dynamics of Stochastic Gradient Descent (SGD) in one-dimensional landscapes, separately considering infinite- and finite-variance noise. Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ''basin''. Under suitable conditions on the noise distribution, we prove that SGD converges to the basin's minimum unless the initial point lies too close to a local maximum. In that near-maximum scenario, we show that SGD can linger for a long time in its neighborhood. For initial points near a ''sharp'' maximum, we show that SGD does not remain stuck there, and we provide results to estimate the probability that it will reach each of the two neighboring minima. Overall, our findings present a nuanced view of SGD's transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.

Convergence, Sticking and Escape: Stochastic Dynamics Near Critical Points in SGD

TL;DR

This paper analyzes stochastic gradient descent in a one-dimensional landscape under additive noise with either infinite or finite variance. It establishes precise time-scaling regimes under which SGD converges to the local basin minimum containing the initial point, and it characterizes lingering near high-order critical points and escape from sharp maxima via random-walk arguments. The results reveal metastable dynamics: on longer time scales the process behaves like a Markov chain over local minima, with escape probabilities from sharp maxima computable from drift parameters. The work provides a rigorous probabilistic framework for understanding SGD transitions in non-convex settings, highlighting the roles of noise tails and local geometry and offering insights applicable to initialization and step-size choices.

Abstract

We study the convergence properties and escape dynamics of Stochastic Gradient Descent (SGD) in one-dimensional landscapes, separately considering infinite- and finite-variance noise. Our main focus is to identify the time scales on which SGD reliably moves from an initial point to the local minimum in the same ''basin''. Under suitable conditions on the noise distribution, we prove that SGD converges to the basin's minimum unless the initial point lies too close to a local maximum. In that near-maximum scenario, we show that SGD can linger for a long time in its neighborhood. For initial points near a ''sharp'' maximum, we show that SGD does not remain stuck there, and we provide results to estimate the probability that it will reach each of the two neighboring minima. Overall, our findings present a nuanced view of SGD's transitions between local maxima and minima, influenced by both noise characteristics and the underlying function geometry.

Paper Structure

This paper contains 16 sections, 18 theorems, 160 equations, 2 figures, 3 tables.

Key Result

Theorem 2.1

Suppose condition H1 is satisfied, and for some $1\leqslant r\leqslant d$ and $\Delta >0$, we have $x_0^{\varepsilon}~\in~(M_{r-1}~+~\Delta, M_{r}-\Delta).$ Then

Figures (2)

  • Figure 1: Illustration of the behavior of the studied object. The Himmelblau function is considered as the objective function to be optimized. The noise $\xi_k$ is generated from an isotropic distribution, with the norm $\|\xi_k\|$ following a Pareto distribution with $\alpha=1.2$. We initialize the SGD at the point $x_0^{\varepsilon} =(-0.270845,-0.923039)$, which is located in close proximity to a local maximum. The number of steps in the trajectory is the same across all four plots and equals $10^5$. The step size varies from left to right as follows: $\varepsilon=10^{-3},10^{-4},10^{-5},10^{-6}.$
  • Figure 2: SGD trajectories on a double-well cubic-spline potential under different noise types.

Theorems & Definitions (41)

  • Theorem 2.1
  • Theorem 2.2
  • Remark 2.3
  • Remark 2.4
  • Theorem 2.5
  • Theorem 2.6
  • Remark 2.7
  • Example 2.8
  • Remark 2.9
  • Theorem 2.10
  • ...and 31 more