Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

Marcel Kühn; Bernd Rosenow

Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

Marcel Kühn, Bernd Rosenow

TL;DR

The paper shows that epoch-based SGD induces anti-correlated gradient noise, derives the exact autocorrelation under without-replacement sampling, and demonstrates a two-regime variance structure tied to Hessian curvature. It provides closed-form expressions for weight and velocity variances and their correlation times in large- and small-curvature directions, validated via Hessian-projected analyses of LeNet and ResNet-20 on CIFAR-10. The findings indicate that anti-correlations suppress fluctuations in flat directions, biasing optimization toward flatter minima and improving generalization, with practical implications for batch sampling strategies in SGD. Overall, the work links sampling discipline to the stochastic dynamics of training and offers a framework to predict variance and diffusion in weight space across curvature regimes.

Abstract

Stochastic Gradient Descent (SGD) has become a cornerstone of neural network optimization due to its computational efficiency and generalization capabilities. However, the gradient noise introduced by SGD is often assumed to be uncorrelated over time, despite the common practice of epoch-based training where data is sampled without replacement. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum. Our main contributions are twofold: First, we calculate the exact autocorrelation of the noise during epoch-based training under the assumption that the noise is independent of small fluctuations in the weight vector, revealing that SGD noise is inherently anti-correlated over time. Second, we explore the influence of these anti-correlations on the variance of weight fluctuations. We find that for directions with curvature of the loss greater than a hyperparameter-dependent crossover value, the conventional predictions of isotropic weight variance under stationarity, based on uncorrelated and curvature-proportional noise, are recovered. Anti-correlations have negligible effect here. However, for relatively flat directions, the weight variance is significantly reduced, leading to a considerable decrease in loss fluctuations compared to the constant weight variance assumption. Furthermore, we present a numerical experiment where training with these anti-correlations enhances test performance, suggesting that the inherent noise structure induced by epoch-based training may play a role in finding flatter minima that generalize better.

Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

TL;DR

Abstract

Paper Structure (34 sections, 5 theorems, 105 equations, 16 figures, 1 table)

This paper contains 34 sections, 5 theorems, 105 equations, 16 figures, 1 table.

Introduction
Background
Related Work
Hessian and gradient sample covariance
Limiting Dynamics and Weight Fluctuations
Theory
Understanding Variance Structure via Velocity Variance and Correlation Time
Variance for late training phase
Numerics
Analysis Setup
Noise Autocorrelations
Variances and Correlation time
Discussion
Definition of Limiting Quantities
Variance Calculation
...and 19 more sections

Key Result

Theorem 4.1

If the total number of examples $N$ is an integer multiple of the batch size $S$ and the parameters $\boldsymbol{\mathbf{\theta}}$ of a network are kept fixed, then the autocorrelation formula for the gradient noise of an epoch-based learning schedule, where the examples for each new epoch are drawn where $M \coloneqq N/S$ signifies the number of batches per epoch.

Figures (16)

Figure 1: Autocorrelations of the SGD noise observed over a span of 20 epochs, equivalent to 20,000 update steps. This data is collected from a later phase in the training process. The autocorrelation is projected onto 5,000 Hessian eigenvectors, and the result is averaged. The theoretical prediction \ref{['eq:noise_corr']} is also displayed along with a $2\sigma$-interval, where $\sigma$ represents the expected standard deviation of the SGD noise. The zero-point correlation is omitted as it is inherently equal to one.
Figure 2: Relationship between Hessian eigenvalues and the variances of weights and velocities, as well as correlation times. In the left panel, we present the variances of weights and velocities. The solid lines signify theoretical predictions from \ref{['variances.eq']}, assuming $\boldsymbol{\mathbf{C}} \approx c_0 \boldsymbol{\mathbf{H}}$ with $c_0$ fitted to the data. The right panel showcases the correlation time together with the theoretical prediction resulting from \ref{['variances.eq']}, which does not require the $\boldsymbol{\mathbf{C}} \approx c_0 \boldsymbol{\mathbf{H}}$ assumption.
Figure 3: Theoretical prediction (left) and numerical estimates (right) of the variance of Hessian noise terms, $\langle \delta H_{ij}^2 \rangle$. Noise variances were computed at the weight vector obtained after 300 epochs of initial training. The dataset was split into $M=1000$ batches of $S=50$ examples each, and for each batch, the deviation of Hessian elements from the full-batch Hessian was evaluated in the subspace spanned by the top 100 Hessian eigenvectors, i.e., $\delta H_{ij}^{(k)}(\boldsymbol{\mathbf{\theta}}_K)$ for $k = K, \dots, K+M$.
Figure 4: The evolution of the loss (left) and accuracy (right) during training of LeNet described in the main text. The statistics are shown for both training and test set. For the first 300 epochs, the exponential learning rate decay was used, and for the last 20 epochs, the learning rate was fixed at the final value of the exponential decay.
Figure 5: The distribution of the approximated 5,000 Hessian eigenvalues of the LeNet discussed in the main text. The inset shows that the smallest approximated eigenvalue has a magnitude of about 0.2.
...and 11 more figures

Theorems & Definitions (5)

Theorem 4.1
Theorem 4.2
Theorem 4.3
Corollary 4.4
Corollary 4.5

Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

TL;DR

Abstract

Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (5)