Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

Jackie Lok; Rishi Sonthalia; Elizaveta Rebrova

Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

Jackie Lok, Rishi Sonthalia, Elizaveta Rebrova

TL;DR

The paper analyzes the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression and shows that both training and generalization errors are governed by a cross-covariance matrix $\mathbf{Z}$ that captures the interaction between original features $\mathbf{X}$ and modified features $\widetilde{\mathbf{X}}$. It proves exact expressions for the mean error dynamics and generalization error in terms of $\mathbf{Z}$, establishes that the linear scaling rule aligns mini-batch and full-batch behavior for infinitesimal step sizes, and reveals a step-size dependent limit for finite steps, a phenomenon invisible to gradient flow analyses. The work further demonstrates a systematic spectrum shrinkage of $\mathbf{W} = \frac{1}{n} \mathbf{X}^{\mathsf{T}} \mathbf{X}$ under batching, both in the large-$n$ fixed-$p$ and proportional regimes, with precise asymptotic descriptions in terms of $\mathbf{Z}$ (and its polynomial relations) and, in the Gaussian/proportional setting, via free probability. Overall, the results provide a rigorous lens on how batching without replacement alters learning dynamics and generalization through spectral modifications, offering insights into batch-size and learning-rate choices for linear models and guiding extensions to more complex architectures.

Abstract

We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.

Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

TL;DR

that captures the interaction between original features

and modified features

. It proves exact expressions for the mean error dynamics and generalization error in terms of

, establishes that the linear scaling rule aligns mini-batch and full-batch behavior for infinitesimal step sizes, and reveals a step-size dependent limit for finite steps, a phenomenon invisible to gradient flow analyses. The work further demonstrates a systematic spectrum shrinkage of

under batching, both in the large-

fixed-

and proportional regimes, with precise asymptotic descriptions in terms of

(and its polynomial relations) and, in the Gaussian/proportional setting, via free probability. Overall, the results provide a rigorous lens on how batching without replacement alters learning dynamics and generalization through spectral modifications, offering insights into batch-size and learning-rate choices for linear models and guiding extensions to more complex architectures.

Abstract

between the original features

and a set of new features

in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing

, a non-commutative polynomial of random matrices, with the sample covariance matrix of

asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.

Paper Structure (41 sections, 13 theorems, 85 equations, 3 figures)

This paper contains 41 sections, 13 theorems, 85 equations, 3 figures.

Introduction
Contributions.
Related works
Gradient flow.
SGD: Sampling with replacement.
SGD: Sampling without replacement.
Linear scaling rule.
Linear models.
Model
Outline.
Analysis of mini-batch gradient descent with random reshuffling
Training error dynamics
Comparison with full-batch and mini-batching with replacement
Comparing the limiting vectors.
Comparing the trajectories.
...and 26 more sections

Key Result

Lemma 3.1

Let $\widetilde{\mathbf{X}}$ and $\mathbf{Z}$ be defined as in eq:modified_batches and eq:batch_Z. Then $\mathbf{Z}$ is a symmetric matrix, and hence all of its eigenvalues are real. Furthermore, $\mathrm{Range}(\mathbf{Z}) \subseteq \mathrm{Range}(\widetilde{\mathbf{X}}^{{\mkern-1.5mu\mathsf{T}}})

Figures (3)

Figure 1: Limiting spectral distributions (lines) of $\alpha \mathbf{W}$ (full-batch) and $\alpha \mathbf{Z}(\alpha / 2)$ (two-batch) compared with empirical distribution of a single $n \times p$ standard Gaussian matrix (histogram).
Figure 2: Empirical generalization error dynamics with standard Gaussian data $\mathbf{X} \in \mathbb{R}^{1,000 \times 1,500}$ ($\gamma = 3/2$), $\sigma = 0.5$, and $\boldsymbol{\beta}_*$ sampled uniformly at random from the unit sphere. Gradient descent with step size $\alpha = 0.2$ compared to $B$-batch gradient descent with step size $\alpha / B$ for $B = 2, 4$. The test error is averaged over $1,000$ simulations with $1,000$ test samples in each.
Figure 3: Empirical generalization error dynamics with standard Gaussian data $\mathbf{X} \in \mathbb{R}^{4,000 \times 1,000}$ ($\gamma = 1/4$), $\sigma = 1$, and $\boldsymbol{\beta}_*$ sampled uniformly at random from the unit sphere. Gradient descent with step size $\alpha = 0.4$ compared to $B$-batch gradient descent with step size $\alpha / B$ for $B = 2, 4$. The test error is averaged over $1,000$ simulations with $1,000$ test samples in each.

Theorems & Definitions (24)

Lemma 3.1
Example 3.2: Two-batch gradient descent
Theorem 3.3
Remark 3.4: Sampling with replacement
Corollary 3.5: Limit with random reshuffling
Remark 3.6: Linear scaling and gradient flow
Remark 3.7: Large step sizes
Proposition 3.8
Theorem 3.9
Corollary 3.10
...and 14 more

Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

TL;DR

Abstract

Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (24)