Anti-Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances in Flat Directions
Marcel Kühn, Bernd Rosenow
TL;DR
The paper shows that epoch-based SGD induces anti-correlated gradient noise, derives the exact autocorrelation under without-replacement sampling, and demonstrates a two-regime variance structure tied to Hessian curvature. It provides closed-form expressions for weight and velocity variances and their correlation times in large- and small-curvature directions, validated via Hessian-projected analyses of LeNet and ResNet-20 on CIFAR-10. The findings indicate that anti-correlations suppress fluctuations in flat directions, biasing optimization toward flatter minima and improving generalization, with practical implications for batch sampling strategies in SGD. Overall, the work links sampling discipline to the stochastic dynamics of training and offers a framework to predict variance and diffusion in weight space across curvature regimes.
Abstract
Stochastic Gradient Descent (SGD) has become a cornerstone of neural network optimization due to its computational efficiency and generalization capabilities. However, the gradient noise introduced by SGD is often assumed to be uncorrelated over time, despite the common practice of epoch-based training where data is sampled without replacement. In this work, we challenge this assumption and investigate the effects of epoch-based noise correlations on the stationary distribution of discrete-time SGD with momentum. Our main contributions are twofold: First, we calculate the exact autocorrelation of the noise during epoch-based training under the assumption that the noise is independent of small fluctuations in the weight vector, revealing that SGD noise is inherently anti-correlated over time. Second, we explore the influence of these anti-correlations on the variance of weight fluctuations. We find that for directions with curvature of the loss greater than a hyperparameter-dependent crossover value, the conventional predictions of isotropic weight variance under stationarity, based on uncorrelated and curvature-proportional noise, are recovered. Anti-correlations have negligible effect here. However, for relatively flat directions, the weight variance is significantly reduced, leading to a considerable decrease in loss fluctuations compared to the constant weight variance assumption. Furthermore, we present a numerical experiment where training with these anti-correlations enhances test performance, suggesting that the inherent noise structure induced by epoch-based training may play a role in finding flatter minima that generalize better.
