Table of Contents
Fetching ...

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

Nikhil Vyas, Depen Morwani, Rosie Zhao, Gal Kaplun, Sham Kakade, Boaz Barak

TL;DR

This work questions the conventional view that SGD noise provides implicit bias benefits in deep learning by distinguishing online (one-epoch) learning from offline training. It combines large-scale empirical studies on vision (CIFAR-5m, ImageNet) and language (C4) with theoretical results for convex online optimization, introducing the Golden Path hypothesis: SGD traces a noisy version of the noiseless gradient descent path but ultimately follows a shared trajectory. Empirically, lowering SGD noise does not hurt in online settings and often improves performance per gradient step, while the loss and function-space analyses show trajectories and predictions converge toward the GD path, as quantified by total variation distances. These findings imply that batch size primarily affects computational cost and stability in online learning, rather than inducing beneficial implicit bias, and invite a gradient-descent–driven theoretical lens for online deep learning.

Abstract

The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes ("SGD noise"). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

TL;DR

This work questions the conventional view that SGD noise provides implicit bias benefits in deep learning by distinguishing online (one-epoch) learning from offline training. It combines large-scale empirical studies on vision (CIFAR-5m, ImageNet) and language (C4) with theoretical results for convex online optimization, introducing the Golden Path hypothesis: SGD traces a noisy version of the noiseless gradient descent path but ultimately follows a shared trajectory. Empirically, lowering SGD noise does not hurt in online settings and often improves performance per gradient step, while the loss and function-space analyses show trajectories and predictions converge toward the GD path, as quantified by total variation distances. These findings imply that batch size primarily affects computational cost and stability in online learning, rather than inducing beneficial implicit bias, and invite a gradient-descent–driven theoretical lens for online deep learning.

Abstract

The success of SGD in deep learning has been ascribed by prior works to the implicit bias induced by finite batch sizes ("SGD noise"). While prior works focused on offline learning (i.e., multiple-epoch training), we study the impact of SGD noise on online (i.e., single epoch) learning. Through an extensive empirical analysis of image and language data, we demonstrate that small batch sizes do not confer any implicit bias advantages in online learning. In contrast to offline learning, the benefits of SGD noise in online learning are strictly computational, facilitating more cost-effective gradient steps. This suggests that SGD in the online regime can be construed as taking noisy steps along the "golden path" of the noiseless gradient descent algorithm. We study this hypothesis and provide supporting evidence in loss and function space. Our findings challenge the prevailing understanding of SGD and offer novel insights into its role in online learning.
Paper Structure (23 sections, 1 theorem, 6 equations, 15 figures)

This paper contains 23 sections, 1 theorem, 6 equations, 15 figures.

Key Result

Theorem 1

Consider the quadratic loss function given by $\mathcal{L}(w) = w^\top Hw$, where $H$ is a positive semi-definite matrix. With stochastic gradients ($g(w)$) modeled as additive gaussian noise, i.e, $g(w) = \nabla \mathcal{L}(w) + \xi$, where $\xi \sim \mathcal{N}(0, \sigma^2 I)$, and for a fixed lea

Figures (15)

  • Figure 1: Experiment on offline (left) and online (right) learning on the C4 dataset across various batch sizes. As shown in prior works, in offline learning (left), higher SGD noise (lower batch size) offers an implicit bias advantage and plateaus at a lower loss. In contrast, we show that in online learning (right), higher SGD noise does not provide any implicit bias benefit to performance, and lower noise reaches a smaller loss. The y-axis measures early stopping (true) loss. See Section \ref{['sec:2']} for more details.
  • Figure 2: A-priori online learning can exhibit two potential scenarios: (a)"Fork in the Road," wherein the selection of batch size leads the optimization algorithm to explore distinct regions of the search space, potentially resulting in different loss outcomes. (a1: better loss for the high-noise path, which is the common case for offline learning, and a2: better loss for the low-noise path). (b)"Golden path," wherein the optimization trajectory remains similar for both gradient descent and SGD. In the latter scenario, the noise in SGD primarily influences the algorithm's traversal speed (and stability) along the path. Our research provides evidence supporting the "golden path" scenario for online learning.
  • Figure 3: Test performance for ResNet-18 trained on CIFAR-5m (left), ConvNext-T on ImageNet (middle), and GPT-2-small on C4 (right) across varying batch sizes. Red corresponds to high SGD noise (small batch size), blue to low SGD noise (high batch size), and purple to an intermediate setting. Solid (resp. dotted) lines correspond to runs in the online (resp. offline) setting. For online learning, lower SGD noise runs consistently outperform higher noise runs per given step. Offline learning performance initially matches online performance, eventually runs with higher noise outperform low-noise runs. All experiments are averaged over $\geq 4$ runs. See Figure \ref{['fig:c4_multiple_seeds']}, \ref{['fig:cifar_multiple_seeds']} for error bars and more hyperparameter values.
  • Figure 4: Changing SGD noise (left: increasing batch size, right: decreasing batch size) during training for ResNet-18 on CIFAR-5m (bottom) and GPT2-small on C4 dataset (top). The red curves correspond to models trained with high SGD noise from initialization, and the blue curves trained with low SGD noise from initialization. In left plot the batch size is increased after $T_0$ steps while in right plot the batch size is decreased after $T_0$ steps. Across both experiments, changing batch size causes the original curve to follow a translated version (dashed) of new batch size curve (except for increasing batch size experiment on CIFAR-5m where no translation is required).
  • Figure 5: This figure illustrates the potential total variation distance in function space for two scenarios: "fork in the road" (left) and "golden path" (right). In the "fork in the road" scenario, training runs with low and high batch sizes explore different regions of the function space, leading to a consistently high total variation distance, even when the batch size is increased. In the "golden path" scenario, the low batch size run follows a noisy version of the high batch size trajectory. Increasing the batch size causes the trajectory to align with the high batch size path, resulting in a total variation distance similar to two independent high batch size trajectories.
  • ...and 10 more figures

Theorems & Definitions (2)

  • Theorem
  • proof