Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression
Jackie Lok, Rishi Sonthalia, Elizaveta Rebrova
TL;DR
The paper analyzes the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression and shows that both training and generalization errors are governed by a cross-covariance matrix $\mathbf{Z}$ that captures the interaction between original features $\mathbf{X}$ and modified features $\widetilde{\mathbf{X}}$. It proves exact expressions for the mean error dynamics and generalization error in terms of $\mathbf{Z}$, establishes that the linear scaling rule aligns mini-batch and full-batch behavior for infinitesimal step sizes, and reveals a step-size dependent limit for finite steps, a phenomenon invisible to gradient flow analyses. The work further demonstrates a systematic spectrum shrinkage of $\mathbf{W} = \frac{1}{n} \mathbf{X}^{\mathsf{T}} \mathbf{X}$ under batching, both in the large-$n$ fixed-$p$ and proportional regimes, with precise asymptotic descriptions in terms of $\mathbf{Z}$ (and its polynomial relations) and, in the Gaussian/proportional setting, via free probability. Overall, the results provide a rigorous lens on how batching without replacement alters learning dynamics and generalization through spectral modifications, offering insights into batch-size and learning-rate choices for linear models and guiding extensions to more complex architectures.
Abstract
We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.
