Table of Contents
Fetching ...

Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

Jackie Lok, Rishi Sonthalia, Elizaveta Rebrova

TL;DR

The paper analyzes the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression and shows that both training and generalization errors are governed by a cross-covariance matrix $\mathbf{Z}$ that captures the interaction between original features $\mathbf{X}$ and modified features $\widetilde{\mathbf{X}}$. It proves exact expressions for the mean error dynamics and generalization error in terms of $\mathbf{Z}$, establishes that the linear scaling rule aligns mini-batch and full-batch behavior for infinitesimal step sizes, and reveals a step-size dependent limit for finite steps, a phenomenon invisible to gradient flow analyses. The work further demonstrates a systematic spectrum shrinkage of $\mathbf{W} = \frac{1}{n} \mathbf{X}^{\mathsf{T}} \mathbf{X}$ under batching, both in the large-$n$ fixed-$p$ and proportional regimes, with precise asymptotic descriptions in terms of $\mathbf{Z}$ (and its polynomial relations) and, in the Gaussian/proportional setting, via free probability. Overall, the results provide a rigorous lens on how batching without replacement alters learning dynamics and generalization through spectral modifications, offering insights into batch-size and learning-rate choices for linear models and guiding extensions to more complex architectures.

Abstract

We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix $Z$ between the original features $X$ and a set of new features $\widetilde{X}$ in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing $Z$, a non-commutative polynomial of random matrices, with the sample covariance matrix of $X$ asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.

Error dynamics of mini-batch gradient descent with random reshuffling for least squares regression

TL;DR

The paper analyzes the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression and shows that both training and generalization errors are governed by a cross-covariance matrix that captures the interaction between original features and modified features . It proves exact expressions for the mean error dynamics and generalization error in terms of , establishes that the linear scaling rule aligns mini-batch and full-batch behavior for infinitesimal step sizes, and reveals a step-size dependent limit for finite steps, a phenomenon invisible to gradient flow analyses. The work further demonstrates a systematic spectrum shrinkage of under batching, both in the large- fixed- and proportional regimes, with precise asymptotic descriptions in terms of (and its polynomial relations) and, in the Gaussian/proportional setting, via free probability. Overall, the results provide a rigorous lens on how batching without replacement alters learning dynamics and generalization through spectral modifications, offering insights into batch-size and learning-rate choices for linear models and guiding extensions to more complex architectures.

Abstract

We study the discrete dynamics of mini-batch gradient descent with random reshuffling for least squares regression. We show that the training and generalization errors depend on a sample cross-covariance matrix between the original features and a set of new features in which each feature is modified by the mini-batches that appear before it during the learning process in an averaged way. Using this representation, we establish that the dynamics of mini-batch and full-batch gradient descent agree up to leading order with respect to the step size using the linear scaling rule. However, mini-batch gradient descent with random reshuffling exhibits a subtle dependence on the step size that a gradient flow analysis cannot detect, such as converging to a limit that depends on the step size. By comparing , a non-commutative polynomial of random matrices, with the sample covariance matrix of asymptotically, we demonstrate that batching affects the dynamics by resulting in a form of shrinkage on the spectrum.
Paper Structure (41 sections, 13 theorems, 85 equations, 3 figures)

This paper contains 41 sections, 13 theorems, 85 equations, 3 figures.

Key Result

Lemma 3.1

Let $\widetilde{\mathbf{X}}$ and $\mathbf{Z}$ be defined as in eq:modified_batches and eq:batch_Z. Then $\mathbf{Z}$ is a symmetric matrix, and hence all of its eigenvalues are real. Furthermore, $\mathrm{Range}(\mathbf{Z}) \subseteq \mathrm{Range}(\widetilde{\mathbf{X}}^{{\mkern-1.5mu\mathsf{T}}})

Figures (3)

  • Figure 1: Limiting spectral distributions (lines) of $\alpha \mathbf{W}$ (full-batch) and $\alpha \mathbf{Z}(\alpha / 2)$ (two-batch) compared with empirical distribution of a single $n \times p$ standard Gaussian matrix (histogram).
  • Figure 2: Empirical generalization error dynamics with standard Gaussian data $\mathbf{X} \in \mathbb{R}^{1,000 \times 1,500}$ ($\gamma = 3/2$), $\sigma = 0.5$, and $\boldsymbol{\beta}_*$ sampled uniformly at random from the unit sphere. Gradient descent with step size $\alpha = 0.2$ compared to $B$-batch gradient descent with step size $\alpha / B$ for $B = 2, 4$. The test error is averaged over $1,000$ simulations with $1,000$ test samples in each.
  • Figure 3: Empirical generalization error dynamics with standard Gaussian data $\mathbf{X} \in \mathbb{R}^{4,000 \times 1,000}$ ($\gamma = 1/4$), $\sigma = 1$, and $\boldsymbol{\beta}_*$ sampled uniformly at random from the unit sphere. Gradient descent with step size $\alpha = 0.4$ compared to $B$-batch gradient descent with step size $\alpha / B$ for $B = 2, 4$. The test error is averaged over $1,000$ simulations with $1,000$ test samples in each.

Theorems & Definitions (24)

  • Lemma 3.1
  • Example 3.2: Two-batch gradient descent
  • Theorem 3.3
  • Remark 3.4: Sampling with replacement
  • Corollary 3.5: Limit with random reshuffling
  • Remark 3.6: Linear scaling and gradient flow
  • Remark 3.7: Large step sizes
  • Proposition 3.8
  • Theorem 3.9
  • Corollary 3.10
  • ...and 14 more