Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

David Shirokoff; Philip Zaleski

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

David Shirokoff, Philip Zaleski

TL;DR

This work analyzes constant-step SGD for separable non-convex objectives by modeling the iterates as a Markov chain on a general state space. It proves a Doeblin-type decomposition that splits the state space into a uniformly transient region and absorbing rectangles, each carrying a unique invariant measure, and shows that all invariant measures form a convex hull and SGD converges geometrically to a mixture of these invariants. The study demonstrates that diffusion approximations can mispredict long-time behavior, including failing to sample the global minimum and allowing bifurcations between local minima. Together, these results provide a rigorous framework for understanding SGD dynamics beyond diffusion models and suggest directions for extending the theory to broader function classes and dynamical regimes.

Abstract

Stochastic gradient descent (SGD) is a popular algorithm for minimizing objective functions that arise in machine learning. For constant step-sized SGD, the iterates form a Markov chain on a general state space. Focusing on a class of separable (non-convex) objective functions, we establish a "Doeblin-type decomposition," in that the state space decomposes into a uniformly transient set and a disjoint union of absorbing sets. Each of the absorbing sets contains a unique invariant measure, with the set of all invariant measures being the convex hull. Moreover the set of invariant measures are shown to be global attractors to the Markov chain with a geometric convergence rate. The theory is highlighted with examples that show: (1) the failure of the diffusion approximation to characterize the long-time dynamics of SGD; (2) the global minimum of an objective function may lie outside the support of the invariant measures (i.e., even if initialized at the global minimum, SGD iterates will leave); and (3) bifurcations may enable the SGD iterates to transition between two local minima. Key ingredients in the theory involve viewing the SGD dynamics as a monotone iterated function system and establishing a "splitting condition" of Dubins and Freedman 1966 and Bhattacharya and Lee 1988.

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

TL;DR

Abstract

Paper Structure (22 sections, 10 theorems, 158 equations, 14 figures)

This paper contains 22 sections, 10 theorems, 158 equations, 14 figures.

Introduction
Background on SGD with Vanishing Lg
Background for SGD with Constant Lg
Background on Iterated Function Systems
Contributions and Organization of the Paper
Main Result
Assumptions and Problem Setting
Main Result
Examples in 1D
Diffusion Approximation Background
SGD on a One Dimensional Double Well
An Example where SGD Does Not Sample the Global Minimum
Mathematical Background
Markov Operators for Iterated Function Systems with Monotone Maps
A Few Lemmas for One Dimensional SGD
...and 7 more sections

Key Result

Theorem 2.2

\newlabelMain_thm_basin0 Given the Markov chain Eq:MarkovOperatorDynamics and Eq:SGD_Markov_Operators corresponding to the SGD dynamics F--Eq:SGDIterates, assume A1--A5 hold. Let $I$ and $T_{{\mathbf{m}}}$ be defined as in Def:Statespace and Eq:DefT_Rd, and let $\eta$ be any value $0 < \eta < 1/K$

Figures (14)

Figure 1: Sketch of the rectangles $T_{\boldsymbol{m}}$.
Figure 1: Visualization of the SGD model problem given by \ref{['F_splitting_ex']}--\ref{['f_1f_2']} and \ref{['DW']} for values (Top) $\lambda=.55> \lambda_c$, and (Bottom) $\lambda=.2 < \lambda_c$. When $\lambda > \lambda_c$, the SGD iterates can cross over the barrier of $F$ and there is a unique invariant measure. When $\lambda < \lambda_c$ the SGD iterates cannot cross over the barrier of $F$ and there are two invariant measures. \newlabelDW_plot0
Figure 1: Visualization of condition \ref{['Con_DF1']} and \ref{['Path_cond']}.
Figure 1: Visualization of two sub-cases for the proof in Step 1 of \ref{['Lem:1d_Pathlength']}.
Figure 2: Model problem \ref{['DW']}. Left: the critical points of the functions $f_1$ (black) and $f_2$ (blue) are plotted as a function of $\lambda$. Solid curves represent local minima and dashed curves represent local maxima. Middle: for each $\lambda$ the vertical cross section of the filled in region represents the left moving set $L$. Right: for each $\lambda$ the vertical cross section of the filled in region represents the right moving set $R$.
...and 9 more figures

Theorems & Definitions (22)

Definition 2.1: The sets $T_{m}$
Theorem 2.2: Main Result
Corollary 2.3: Unique invariant measure
Theorem 5.1: Dubins and Freedman Dubins66
Theorem 5.2: Bhattacharya and Lee Bhattacharya88
Proposition 6.1: Basic properties of $L$ and $R$
Proof 1
Proposition 6.2: Properties of $T_{m}$
Proof 2
Proposition 6.3: Dynamics related to $L$ and $R$
...and 12 more

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

TL;DR

Abstract

Convergence of Markov Chains for Constant Step-size Stochastic Gradient Descent with Separable Functions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (22)