Table of Contents
Fetching ...

On the different regimes of Stochastic Gradient Descent

Antonio Sclocchi, Matthieu Wyart

TL;DR

The paper analyzes how stochastic gradient descent behaves under varying batch sizes and learning rates, revealing a three-regime phase diagram (noise-dominated SGD, first-step-dominated SGD, and gradient descent) whose boundaries depend on training set size $P$ and task hardness. By modeling SGD as an online stochastic differential equation and applying it to a teacher–student perceptron with hinge loss, the authors derive precise scalings: the critical batch size $B^*$ scales as $B^*\sim P^{\gamma}$ with $\gamma=1/(1+\chi)$, and the end-of-training weight norms follow $\|\mathbf w\|\sim TP^{\gamma}$ in the noise-dominated regime. They validate the theory on deep networks (fully-connected and CNNs) across MNIST and CIFAR-10, showing the phase diagram persists and that $B^*$ grows with $P$ according to dataset hardness. Extensions to momentum and weight decay are discussed, along with large-margin and small-margin regimes, NTK-related lazy behavior, and mechanisms to adapt generalization to data structure, making the framework practically relevant for choosing SGD hyperparameters in real-world training. Overall, the work links SGD noise, data size, and task difficulty to dynamical regimes and generalization, offering actionable guidance for optimizing training in deep learning.

Abstract

Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size $B$, and the step size or learning rate $η$. For small $B$ and large $η$, SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' $T\equiv η/B$. Yet this description is observed to break down for sufficiently large batches $B\geq B^*$, or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the $B$-$η$ plane that separates three dynamical phases: (i) a noise-dominated SGD governed by temperature, (ii) a large-first-step-dominated SGD and (iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size $B^*$ separating regimes (i) and (ii) scale with the size $P$ of the training set, with an exponent that characterizes the hardness of the classification problem.

On the different regimes of Stochastic Gradient Descent

TL;DR

The paper analyzes how stochastic gradient descent behaves under varying batch sizes and learning rates, revealing a three-regime phase diagram (noise-dominated SGD, first-step-dominated SGD, and gradient descent) whose boundaries depend on training set size and task hardness. By modeling SGD as an online stochastic differential equation and applying it to a teacher–student perceptron with hinge loss, the authors derive precise scalings: the critical batch size scales as with , and the end-of-training weight norms follow in the noise-dominated regime. They validate the theory on deep networks (fully-connected and CNNs) across MNIST and CIFAR-10, showing the phase diagram persists and that grows with according to dataset hardness. Extensions to momentum and weight decay are discussed, along with large-margin and small-margin regimes, NTK-related lazy behavior, and mechanisms to adapt generalization to data structure, making the framework practically relevant for choosing SGD hyperparameters in real-world training. Overall, the work links SGD noise, data size, and task difficulty to dynamical regimes and generalization, offering actionable guidance for optimizing training in deep learning.

Abstract

Modern deep networks are trained with stochastic gradient descent (SGD) whose key hyperparameters are the number of data considered at each step or batch size , and the step size or learning rate . For small and large , SGD corresponds to a stochastic evolution of the parameters, whose noise amplitude is governed by the ''temperature'' . Yet this description is observed to break down for sufficiently large batches , or simplifies to gradient descent (GD) when the temperature is sufficiently small. Understanding where these cross-overs take place remains a central challenge. Here, we resolve these questions for a teacher-student perceptron classification model and show empirically that our key predictions still apply to deep networks. Specifically, we obtain a phase diagram in the - plane that separates three dynamical phases: (i) a noise-dominated SGD governed by temperature, (ii) a large-first-step-dominated SGD and (iii) GD. These different phases also correspond to different regimes of generalization error. Remarkably, our analysis reveals that the batch size separating regimes (i) and (ii) scale with the size of the training set, with an exponent that characterizes the hardness of the classification problem.
Paper Structure (46 sections, 96 equations, 15 figures)

This paper contains 46 sections, 96 equations, 15 figures.

Figures (15)

  • Figure 1: SGD phase diagrams for different data and architectures. The different data sets considered are a teacher perceptron model with $P=8192$, dimension $d=128$ and data distribution of Eq. \ref{['eq:rho_x1']} with $\chi=1$(A.1), $P=32768$ images of MNIST (B.1) and $P=16384$ images of CIFAR 10 (C.1). The different neural network architectures trained on these datasets correspond respectively to (A.2) a perceptron, for which the output is linear in both the input ${\mathbf{x}}$ and the weights ${\mathbf{w}}$, and trained with hinge loss margin $\kappa=2^{-7}$; (B.2) a fully-connected network with $5$ hidden layers, $128$ hidden neurons per layer and margin $\kappa=2^{-15}$; (C.2) a CNN made by several blocks composed of depth-wise, point-wise and standard convolutions plus residual connections (more details in \ref{['app:expModels']}), with margin $\kappa=2^{-15}$. Panels (A.3),(B.3),(C.3) display the alignment after training in the $\eta,B$ phase diagram. The black dots correspond to diverging trainings where the algorithm does not converge. We can distinguish the noise-dominated SGD regime, for which the alignment is constant along the diagonals $\frac{\eta}{B}=T$. Within the first-step-dominated SGD, instead, the alignment is constant at constant $\eta$. For small $\eta$, one enters in the gradient descent (GD) regime where the alignment does not depend on $\eta$ and $B$. Taking this value $m_{GD}$ of the alignment as a reference, the black dashed line $\eta_c(B)$ delimiting the GD region corresponds to the alignment taking value $2 m_{GD}$. The vertical black dashed line guides the eye to indicate the critical batch size $B^*$. Panels (A.4),(B.4),(C.4) display the test error again as a function of $\eta,B$. As expected, the test error is constant along the diagonals $\frac{\eta}{B}=T$ for noise-dominated SGD and constant in the GD regime. For first-step-dominated SGD, the test error can be affected by both $\eta$ and $B$, and can improve at very large batches (see discussion below).
  • Figure 2: Dynamical trajectories of SGD in the $(w_1,w_\perp)$ plane, at fixed $T$ and varying batch size as indicated in caption. Black circles indicate the first step of SGD, black stars indicate the last one. For small enough batches (and therefore small learning rates), trajectories converge to the online SDE solution (black dashed line). For large batches, this is not true anymore, and the final magnitude of the weights increases with batch size. The location of stopping weights corresponds to zero loss, which can be approximately determined by measuring the hinge loss values $L_{train}(w_1,w_\perp)$ (shown in color) computed as a function of the perceptron weights ${\mathbf{w}} = w_1\mathbf{e}_{1} + w_\perp \bm{\xi}$. Here, $\bm{\xi}$ is a $(d-1)$-dimensional Gaussian random vector. The white area corresponds to interpolating solutions $L_{train}=0$ in this simplified set-up. For full-batch, we observe that ${\mathbf{w}}$ can land directly in the white area and therefore fit the data with at most few steps. This behavior affects the test error when $\eta$ is large (Fig. \ref{['fig:phase']}-A4). Data correspond to $P=16384$, $d=128$, $\kappa=0.01$, $\chi=1$, $T=2$.
  • Figure 3: For large learning rates, the dynamics of the alignment is different in the small batch and large batch regimes. (Main panel) Perceptron: in this case, the alignment is proportional to the student component $w_1$; data for fixed $\eta=512$, same setting as Fig. \ref{['fig:phase']}. For small $B$, $w_1$ grows during the training dynamics, while, for large $B$, its final value is reached after a single training step. (Inset) Fully connected network on MNIST, small margin ($\kappa=2^{-15}$), fixed $\eta=16$, same setting as Fig. \ref{['fig:phase']}: For small and large batch, the alignment shows a similar dynamics to the perceptron case, although for large batch it reaches its final value after some training steps (and not just a single step).
  • Figure 4: The critical batch size $B^*$ depends on the size of the training set as predicted by Eq. \ref{['eq:BcP']}. (a) $w_1$ at the end of training for the perceptron. Inset: $w_1$ depends on $\eta$, $B$ and $P$ for small $B$, while it only depends on $\eta$ for large $B$. We observe that the cross-over between small and large $B$ depends on $P$. Main panel: the curves collapse in one curve by rescaling $w_1$ by $\eta$ and $B$ by $B^*\propto P^\gamma$, $\gamma=\frac{1}{1+\chi}$. This is consistent with $w_1\sim\frac{\eta}{B}P^\gamma$ for small $B$ (Eq. \ref{['eq:tcross_wcross']}) and $w_1\sim \eta$ for large $B$ (section \ref{['sec:Bcrit']}). (b) Fully-connected network on parity MNIST. Alignment $\langle y({\mathbf{x}}) f({\mathbf{x}})\rangle_{\mathbf{x}}$ at the end of training (measured as in Eq. \ref{['eq:align']}). Inset: as for $w_1$ in panel (a), $\langle y({\mathbf{x}}) f({\mathbf{x}})\rangle_{\mathbf{x}}$ depends on $\eta$, $B$ and $P$ for small $B$, and it only depends on $\eta$ for large $B$. Main panel: rescaling $B$ by $P^{0.2}$ aligns the cross-over batch size at different $P$s, suggesting a dependence $B^*\propto P^{0.2}$. The curves are approximately collapsed by the rescaling of the y-axis as $\langle y({\mathbf{x}}) f({\mathbf{x}})\rangle_{\mathbf{x}}/\eta$.
  • Figure 5: (a) Dynamical trajectories of SGD with momentum sutskever2013 (momentum coefficient $m=0.9$) in the $(w_1, w_\perp)$ at fixed $\eta/B=2$ and varying batch size as in caption. The setting is identical to Fig. 2 in the main text. For small batch ($B<512$), the final point of the dynamics is determined by the effective temperature $T_{eff}=\frac{\eta}{(1-m) B}$. In fact, with respect to Fig. 2, the final point of $w_\perp\propto T_{eff}$ is shifted upwards of a factor $1/(1-m)=10$. For large batch ($B \gtrsim 512$), the dynamics is dominated by the first steps. (b) Dynamics of the perceptron weights using SGD with weight decay, varying $\Lambda$, fixed $B=8$, $\eta=16$, $\chi=1$, $d=128$, $P=16384$, $\kappa=2^{-7}$.$\|{\mathbf{w}}_\perp\|$ is unaffected by $\Lambda$ and converges to the same value proportional to $T=\eta/B$. The growth of $w_1$, instead, is interrupted at a time scale that increases for decreasing $\Lambda$.
  • ...and 10 more figures