Table of Contents
Fetching ...

Stochastic Gradient Flow Dynamics of Test Risk and its Exact Solution for Weak Features

Rodrigo Veiga, Anastasia Remizova, Nicolas Macris

TL;DR

This work analyzes the test risk dynamics of stochastic gradient flow (SGF) in the small-learning-rate regime by recasting SGD as a continuous-time Itô process and deploying a path-integral (Laplace) approximation. A general covariance formula for fluctuations around the pure gradient flow trajectory is derived, enabling an explicit comparison between GF and SGF test risks. The theory is then applied to a weak-features regression model that exhibits double descent, yielding closed-form expressions in terms of Marchenko–Pastur spectra and time integrals; the SGF corrections correctly predict deviations from GF seen in SGD simulations. Overall, the paper provides a tractable, analytically controlled framework to quantify how stochasticity in SGD reshapes generalization dynamics over the entire training horizon, with potential extensions to more complex models and activation schemes. The results offer a principled basis for understanding stochastic effects on double-descent curves and the time-dependent generalization behavior in overparameterized settings.

Abstract

We investigate the test risk of continuous-time stochastic gradient flow dynamics in learning theory. Using a path integral formulation we provide, in the regime of a small learning rate, a general formula for computing the difference between test risk curves of pure gradient and stochastic gradient flows. We apply the general theory to a simple model of weak features, which displays the double descent phenomenon, and explicitly compute the corrections brought about by the added stochastic term in the dynamics, as a function of time and model parameters. The analytical results are compared to simulations of discrete-time stochastic gradient descent and show good agreement.

Stochastic Gradient Flow Dynamics of Test Risk and its Exact Solution for Weak Features

TL;DR

This work analyzes the test risk dynamics of stochastic gradient flow (SGF) in the small-learning-rate regime by recasting SGD as a continuous-time Itô process and deploying a path-integral (Laplace) approximation. A general covariance formula for fluctuations around the pure gradient flow trajectory is derived, enabling an explicit comparison between GF and SGF test risks. The theory is then applied to a weak-features regression model that exhibits double descent, yielding closed-form expressions in terms of Marchenko–Pastur spectra and time integrals; the SGF corrections correctly predict deviations from GF seen in SGD simulations. Overall, the paper provides a tractable, analytically controlled framework to quantify how stochasticity in SGD reshapes generalization dynamics over the entire training horizon, with potential extensions to more complex models and activation schemes. The results offer a principled basis for understanding stochastic effects on double-descent curves and the time-dependent generalization behavior in overparameterized settings.

Abstract

We investigate the test risk of continuous-time stochastic gradient flow dynamics in learning theory. Using a path integral formulation we provide, in the regime of a small learning rate, a general formula for computing the difference between test risk curves of pure gradient and stochastic gradient flows. We apply the general theory to a simple model of weak features, which displays the double descent phenomenon, and explicitly compute the corrections brought about by the added stochastic term in the dynamics, as a function of time and model parameters. The analytical results are compared to simulations of discrete-time stochastic gradient descent and show good agreement.
Paper Structure (39 sections, 212 equations, 7 figures)

This paper contains 39 sections, 212 equations, 7 figures.

Figures (7)

  • Figure 1: The solution $\bm{w}^{\text{ode}}(\tau)$ of the first order ODE $\dot{\bm{w}}(\tau) = \bm{f} (\tau, \bm{w}(\tau))$, $\bm{w}(0)= \bm{w}_0$ yields a minimum vanishing action. The dominant contributions to the path integral for small $\gamma$ are fluctuations of order $\mathcal{O} (\sqrt\gamma)$ around this trajectory.
  • Figure 2: Continuous curve: the difference between SGF and GF test risks for $t\to + \infty$ according to Eq. \ref{['theory-pred-1']}. Dots: difference of SGD and GD test risks obtained from numerical simulations (averaged over $1000$ different random subsets $\mathcal{A}$). Simulation parameters: $d=1000$, $n=400$, such that $\psi = 2.5$, and $\gamma= 10^{-3}$ (hence $\gamma'=1)$. Vectors $\bm{\beta}$, $\bm{\beta}_0$ are taken at random on the unit $1000$-dimensional sphere and here $\norm{\bm{\beta}-\bm{\beta}_0}^2 \approx 2.11$.
  • Figure 3: Continuous curves: difference between test risk of SGF and GF for various times $t\in [10^{-3}, 10^3]$ according to \ref{['assympt-2']}. Dots: difference of test risk of SGD and GD obtained from numerical simulations with the same parameters as in Fig \ref{['fig:cov_gausmodel']}. Asymptotic analytical theory underestimates SGD quantities for intermediate times $1\lessapprox t\lessapprox 10$ but works well for small and large times.
  • Figure 4: Theoretical prediction for the difference between test risk of SGF and GF in the $(t, \alpha)$ plane. Here we fix $\psi =2.5$.
  • Figure 5: Plots depicting $\mathcal{E}_{\mathrm{test}}^{\text{GF}}(t)$ for $\psi = 2.5$. Solid lines represent theoretical asymptotic estimation. Markers on \ref{['fig:etestode_b']} represent the results of numerical simulations of GD (averaged over $1000$ different random subsets $\mathcal{A}$). Simulation parameters: $d=1000$, $\gamma= 10^{-3}$; vectors $\bm{\beta}$, $\bm{\beta}_0$ are taken at random on the unit $1000$-dimensional sphere and here $\norm{\bm{\beta}-\bm{\beta}_0}^2 \approx 2.11$
  • ...and 2 more figures