Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

Davide Barbieri; Matteo Bonforte; Peio Ibarrondo

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

Davide Barbieri, Matteo Bonforte, Peio Ibarrondo

TL;DR

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes analyzes SGD through a degenerate Fokker-Planck framework derived from its SDE approximation, revealing two training regimes: a drift-dominated phase concentrating parameters near local minima, and a diffusion-driven phase enabling escapes via mean exit times. The authors develop two rigorous avenues for long-time behavior: a duality-based approach using Noisy SGD to obtain stationary states in nondegenerate settings, and entropy methods (Bakry–Émery) to obtain convergence results in degenerate or near-degenerate regimes, including existence of invariant measures and exponential convergence under suitable conditions. The work provides quantitative MET bounds, concentration estimates, and a structured view of how parameter distributions evolve toward steady states—offering both theoretical insights and guiding open questions about the global behavior and mass splitting of SGD trajectories. By bridging stochastic optimization and PDE theory, the paper clarifies SGD’s exploration–exploitation dynamics and informs practical understanding of training dynamics in overparameterized models.

Abstract

In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

TL;DR

Abstract

Paper Structure (15 sections, 29 theorems, 256 equations)

This paper contains 15 sections, 29 theorems, 256 equations.

Introduction
Main results
Related work on the SGD and the SDE approximation
Analysis in the drift regime
Local mass concentration
Analysis in the diffusion regime
Asymptotic behaviour of the SGD
Duality method for Noisy SGD
Existence of steady states: proof of Theorem \ref{['theo:existence']}
The question of convergence to stationary measures
Entropy method
Some conclusions and open questions
Approximation of the Noisy SGD
Deduction of the Mean Exit Time problem
Kramers' Law

Key Result

Theorem 1.2

Assume that $L$ is $\lambda$-convex in $B_{(1+\delta)R_0}(0)$ with a minimum at 0 and $\lambda>0$. Let $\rho$ be a weak solution of FP intro with $0\le Q(x)\le \sigma I_{d\times d}$ for every $x\in B_{(1+\delta)R_0}(0)$. Let us consider $\varphi(t,r):[t_0,\infty)\times\mathbb{R}_+\rightarrow\mathbb{ Then, given any $a>0$ and $\alpha,\beta\in(0,1)$, there exists an $\varepsilon_0>0$ such that for e

Theorems & Definitions (49)

Definition 1.1: Weak solutions.
Theorem 1.2: Local mass concentration
Theorem 1.3: Lower bound for MET
Theorem 1.4: Upper bound for MET
Remark : About the condition \ref{['assumption upper bound MET']} for the upper bounds
Theorem 1.5: Existence of steady states
Theorem 1.6: Convergence in the Non-Hörmander case
Lemma 2.1
proof
Corollary 2.2
...and 39 more

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

TL;DR

Abstract

Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (49)