On the Trajectories of SGD Without Replacement

Pierfrancesco Beneventano

On the Trajectories of SGD Without Replacement

Pierfrancesco Beneventano

TL;DR

This work analyzes stochastic gradient descent without replacement (random reshuffling) as the practically dominant optimization method for training deep neural networks. It shows that, in a regime where the product of the learning rate and the Hessian, $c=\eta k$, is not necessarily small, SGD without replacement behaves like gradient descent plus an extra drift on a novel regularizer that penalizes the gradient covariance, effectively biasing the trajectory toward flatter regions. The key theoretical contribution is a main result that decouples the dynamics into a descent along high-curvature directions (as in SGD with replacement) and a drift-driven regularization along flat directions, which reshapes the Hessian spectrum and can explain empirical observations such as faster saddle escape, reduced oscillations, and implicit sparsification of the Hessian and Fisher matrices. The work also analyzes the edge of stability, showing phase-transition type behavior where the drift can dominate the GD step, and provides connections to Fisher information and prior empirical findings on generalization and batch-size effects. Overall, the implicit drift regularizer offers a principled explanation for why random reshuffling often yields faster convergence and better generalization in practice, by guiding optimization toward flatter minima and through saddle regions more efficiently than i.i.d. sampling schemes.

Abstract

This article examines the implicit regularization effect of Stochastic Gradient Descent (SGD). We consider the case of SGD without replacement, the variant typically used to optimize large-scale neural networks. We analyze this algorithm in a more realistic regime than typically considered in theoretical works on SGD, as, e.g., we allow the product of the learning rate and Hessian to be $O(1)$ and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance along the flat ones. As a consequence, SGD without replacement travels flat areas and may escape saddles significantly faster than SGD with replacement. On several vision tasks, the novel regularizer penalizes a weighted trace of the Fisher Matrix, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).

On the Trajectories of SGD Without Replacement

TL;DR

, is not necessarily small, SGD without replacement behaves like gradient descent plus an extra drift on a novel regularizer that penalizes the gradient covariance, effectively biasing the trajectory toward flatter regions. The key theoretical contribution is a main result that decouples the dynamics into a descent along high-curvature directions (as in SGD with replacement) and a drift-driven regularization along flat directions, which reshapes the Hessian spectrum and can explain empirical observations such as faster saddle escape, reduced oscillations, and implicit sparsification of the Hessian and Fisher matrices. The work also analyzes the edge of stability, showing phase-transition type behavior where the drift can dominate the GD step, and provides connections to Fisher information and prior empirical findings on generalization and batch-size effects. Overall, the implicit drift regularizer offers a principled explanation for why random reshuffling often yields faster convergence and better generalization in practice, by guiding optimization toward flatter minima and through saddle regions more efficiently than i.i.d. sampling schemes.

Abstract

and we do not specify any model architecture, learning task, or loss (objective) function. Our core theoretical result is that optimizing with SGD without replacement is locally equivalent to making an additional step on a novel regularizer. This implies that the expected trajectories of SGD without replacement can be decoupled in (i) following SGD with replacement (in which batches are sampled i.i.d.) along the directions of high curvature, and (ii) regularizing the trace of the noise covariance along the flat ones. As a consequence, SGD without replacement travels flat areas and may escape saddles significantly faster than SGD with replacement. On several vision tasks, the novel regularizer penalizes a weighted trace of the Fisher Matrix, thus encouraging sparsity in the spectrum of the Hessian of the loss in line with empirical observations from prior work. We also propose an explanation for why SGD does not train at the edge of stability (as opposed to GD).

Paper Structure (108 sections, 13 theorems, 212 equations, 6 figures, 1 table)

This paper contains 108 sections, 13 theorems, 212 equations, 6 figures, 1 table.

Introduction
The Problem and the Background
Informal Overview of the Results
Decoupling the Dynamics
Shaping the Hessian - the Implicit Regularization
Escaping Saddles
Escaping the Edge of Stability
With vs Without Replacement.
Outline of the Remainder of the Article
The Problem
Training Neural Network and the SGD
How we train neural networks and why.
Implicit Regularization to Explain Generalization
Generalization.
Implicit Regularization.
...and 93 more sections

Key Result

Theorem 1

In expectation over batch sampling, one epoch of SGD without replacement differs from the same number of steps of SGD with replacement or GD, by a regularizing step of size $\eta/(b-1)$. At a stationary point, the regularizer of the $i-th$ parameter is a weighted trace of the covariance of the gradi where $S_i$ is a diagonal matrix which depends on the Hessian. Away from a stationary point, there

Figures (6)

Figure 1: Here are the dynamics in the setting outlined in § \ref{['section:setting_plots']}, for small (red) and big (orange) learning rates. The colored surface is the loss landscape, the blue line is the manifold of minima, the white bullet is the initialization, and the magenta bullets are the lowest norm solutions. SGD without and with replacement both move towards the lowest norm solution, unlike noised GD. However, SGD without replacement is converging faster, with better accuracy, while steadily reducing the variance. SGD with replacement is approaching slower and will oscillate around it with a higher variance and less precision.
Figure 2: This is the behavior of SGDs with replacement (lighter blue) vs SGD without replacement (darker blue) starting from a spurious local minimum of the loss of a ReLU network with the same hyperparameters. SGD without replacement escapes local minima (by traveling flat areas) in which GD converged faster than SGD with replacement and with much smaller oscillations. They both converge to a global minimum. See \ref{['fig:W_comparison']}.
Figure 3: The setting is the same as \ref{['fig:intro']} outlined in § \ref{['section:setting_plots']}. On the left we have the trajectories of full-batch GD for small (red) and big (orange) learning rates. On the right the trajectory of SGD without replacement with small learning rate until convergence. On the left, we see that by increasing the step size, the algorithm identifies a solution closer to the lowest norm one. This implicit regularization effect arises from discretization, consistent with lewkowycz_large_2020jastrzebski_catastrophic_2021. However, no matter the learning rate, we see that GD stops as soon as it gets to a stationary area, while SGD without replacement navigates with an oscillatory trajectory the manifold of minima, converging to the lowest norm one.
Figure 4: This is dynamics of SGDs without replacement for ReLU network on a synthetic dataset with fixed learning rate. We can observe how first we have a convergence (mixed) phase, then a regularization phase, then a convergence phase again which ends up in a global minimum. Convergence to the global minimum was impossible without the regularization phase, e.g., in the case of GD.
Figure 5: We see here the SGD without replacement escapes local minima to which GD converged. Precisely, we can see a regularization phase while traveling a flat area followed by a grokking effect that implies convergence to a global minimum. We fit "W" shaped one-dimensional dataset with a shallow-ReLU network and MSE. We run GD which converges to the function represented to the right (in less than 400 steps). This point is a spurious local minimum. We then run SGD without replacement with the same learning rate. The orange functions above are the functions represented by the neural network at intermediate steps, the red function is the function of the neural network at convergence. We can see below that SGD travels the flat area regularizing the top eigenvalues of the Hessian and the gradients of the model in both the parameters space the inputs.
...and 1 more figures

Theorems & Definitions (17)

Theorem 1: Informal corollary of \ref{['theo:SGD_bias_eta']}
Theorem 2
Theorem 3
Proposition 4: Traveling Flat Regions.
Proposition 5: Regularizing the trace of Hessian
Proposition 6: Escaping strict saddles.
Proposition 7: Escaping high-order saddles.
Proposition 8: Breaking point
Proposition 9
Proposition 10
...and 7 more

On the Trajectories of SGD Without Replacement

TL;DR

Abstract

On the Trajectories of SGD Without Replacement

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (17)