Table of Contents
Fetching ...

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

Ziqiao Wang, Yongyi Mao

TL;DR

This work models SGD with a Gaussian-gradient-noise SDE to enable information-theoretic generalization analysis. It develops two complementary bounds: a trajectory-based bound that sums mutual-information terms along training trajectories and a terminal-state bound based on the stationary distribution around local minima, each capturing distinct aspects of SGD generalization. Empirically, the trajectory bound is tighter than prior results, while the terminal-state bound achieves fast decay rates comparable to stability-based bounds; both bounds align well with SGD dynamics and gradient-noise structure. Overall, the approach provides a practical, data- and algorithm-dependent framework for understanding and predicting SGD generalization through its SDE surrogate.

Abstract

Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds.

Two Facets of SDE Under an Information-Theoretic Lens: Generalization of SGD via Training Trajectories and via Terminal States

TL;DR

This work models SGD with a Gaussian-gradient-noise SDE to enable information-theoretic generalization analysis. It develops two complementary bounds: a trajectory-based bound that sums mutual-information terms along training trajectories and a terminal-state bound based on the stationary distribution around local minima, each capturing distinct aspects of SGD generalization. Empirically, the trajectory bound is tighter than prior results, while the terminal-state bound achieves fast decay rates comparable to stability-based bounds; both bounds align well with SGD dynamics and gradient-noise structure. Overall, the approach provides a practical, data- and algorithm-dependent framework for understanding and predicting SGD generalization through its SDE surrogate.

Abstract

Stochastic differential equations (SDEs) have been shown recently to characterize well the dynamics of training machine learning models with SGD. When the generalization error of the SDE approximation closely aligns with that of SGD in expectation, it provides two opportunities for understanding better the generalization behaviour of SGD through its SDE approximation. Firstly, viewing SGD as full-batch gradient descent with Gaussian gradient noise allows us to obtain trajectory-based generalization bound using the information-theoretic bound from Xu and Raginsky [2017]. Secondly, assuming mild conditions, we estimate the steady-state weight distribution of SDE and use information-theoretic bounds from Xu and Raginsky [2017] and Negrea et al. [2019] to establish terminal-state-based generalization bounds. Our proposed bounds have some advantages, notably the trajectory-based bound outperforms results in Wang and Mao [2022], and the terminal-state-based bound exhibits a fast decay rate comparable to stability-based bounds.
Paper Structure (40 sections, 24 theorems, 71 equations, 7 figures, 1 table)

This paper contains 40 sections, 24 theorems, 71 equations, 7 figures, 1 table.

Key Result

Lemma 2.1

Assume the loss $\ell(w,Z)$ is $R$-subgaussianA random variable $X$ is $R$-subgaussian if for any $\rho\in \mathbb{R}$, $\log {\mathbb E} \exp\left( \rho \left(X- {\mathbb E}X\right) \right) \le \rho^2R^2/2$. Note that a bounded loss is guaranteed to be subgaussian. for any $w\in\mathcal{W}$, then

Figures (7)

  • Figure 1: Performance of VGG-11 and ResNet-18 trained with SGD and SDE.
  • Figure 2: Gradient-related quantities of SGD or its discrete SDE approximation. In (d), since per-sample gradient is ill-defined when BatchNormalization is used, we do not track $tr\left\{\log\left(\Sigma^{-1}_t\Sigma^{\mu}_t\right)\right\}$.
  • Figure 3: Hessian-related quantities of SGD or its discrete SDE approximation.
  • Figure 4: (a-b,e-f) The dynamics of $\eta/2-\lambda_1$. Note that learning rate decays by $0.1$ at the $40,000^{\rm th}$ and the $60,000^{\rm th}$ iteration. (c-d,g-h) The distance of current model parameters from its initialization.
  • Figure 5: Estimated trajectory-based bound and terminal-state based bound, with $R$ excluded. Zoomed-in figures of generalization error are given in Figure \ref{['fig:errs']} in Appendix.
  • ...and 2 more figures

Theorems & Definitions (46)

  • Lemma 2.1: xu2017information
  • Lemma 2.2: negrea2019information
  • Lemma 3.1
  • Lemma 3.2
  • Theorem 3.1
  • Corollary 3.1
  • Theorem 3.2
  • Remark 3.1
  • Lemma 3.3
  • Theorem 4.1
  • ...and 36 more