Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes
Davide Barbieri, Matteo Bonforte, Peio Ibarrondo
TL;DR
Is Stochastic Gradient Descent Effective? A PDE Perspective on Machine Learning processes analyzes SGD through a degenerate Fokker-Planck framework derived from its SDE approximation, revealing two training regimes: a drift-dominated phase concentrating parameters near local minima, and a diffusion-driven phase enabling escapes via mean exit times. The authors develop two rigorous avenues for long-time behavior: a duality-based approach using Noisy SGD to obtain stationary states in nondegenerate settings, and entropy methods (Bakry–Émery) to obtain convergence results in degenerate or near-degenerate regimes, including existence of invariant measures and exponential convergence under suitable conditions. The work provides quantitative MET bounds, concentration estimates, and a structured view of how parameter distributions evolve toward steady states—offering both theoretical insights and guiding open questions about the global behavior and mass splitting of SGD trajectories. By bridging stochastic optimization and PDE theory, the paper clarifies SGD’s exploration–exploitation dynamics and informs practical understanding of training dynamics in overparameterized models.
Abstract
In this paper we analyze the behaviour of the stochastic gradient descent (SGD), a widely used method in supervised learning for optimizing neural network weights via a minimization of non-convex loss functions. Since the pioneering work of E, Li and Tai (2017), the underlying structure of such processes can be understood via parabolic PDEs of Fokker-Planck type, which are at the core of our analysis. Even if Fokker-Planck equations have a long history and a extensive literature, almost nothing is known when the potential is non-convex or when the diffusion matrix is degenerate, and this is the main difficulty that we face in our analysis. We identify two different regimes: in the initial phase of SGD, the loss function drives the weights to concentrate around the nearest local minimum. We refer to this phase as the drift regime and we provide quantitative estimates on this concentration phenomenon. Next, we introduce the diffusion regime, where stochastic fluctuations help the learning process to escape suboptimal local minima. We analyze the Mean Exit Time (MET) and prove upper and lower bounds of the MET. Finally, we address the asymptotic convergence of SGD, for a non-convex cost function and a degenerate diffusion matrix, that do not allow to use the standard approaches, and require new techniques. For this purpose, we exploit two different methods: duality and entropy methods. We provide new results about the dynamics and effectiveness of SGD, offering a deep connection between stochastic optimization and PDE theory, and some answers and insights to basic questions in the Machine Learning processes: How long does SGD take to escape from a bad minimum? Do neural network parameters converge using SGD? How do parameters evolve in the first stage of training with SGD?
