Does SGD really happen in tiny subspaces?

Minhak Song; Kwangjun Ahn; Chulhee Yun

Does SGD really happen in tiny subspaces?

Minhak Song, Kwangjun Ahn, Chulhee Yun

TL;DR

This work challenges the notion that neural network optimization primarily unfolds in a tiny dominant Hessian subspace. By projecting SGD updates onto the top-$k$ dominant subspace and onto its orthogonal bulk, the authors show that learning does not progress when confined to the dominant directions, while bulk-subspace updates drive comparable progress to standard SGD. The spurious gradient alignment with the dominant subspace is traced to stochastic noise in SGD (and to self-stabilization in the Edge of Stability regime), and this phenomenon persists under SAM and across momentum/Adam variants, which instead accelerate learning by amplifying bulk-direction updates. The findings suggest that practical optimization should focus on bulk subspace dynamics rather than relying on dominant-subspace guidance, with potential implications for more efficient training strategies and a deeper understanding of optimization in high-dimensional loss landscapes.

Abstract

Understanding the training dynamics of deep neural networks is challenging due to their high-dimensional nature and intricate loss landscapes. Recent studies have revealed that, along the training trajectory, the gradient approximately aligns with a low-rank top eigenspace of the training loss Hessian, referred to as the dominant subspace. Given this alignment, this paper explores whether neural networks can be trained within the dominant subspace, which, if feasible, could lead to more efficient training methods. Our primary observation is that when the SGD update is projected onto the dominant subspace, the training loss does not decrease further. This suggests that the observed alignment between the gradient and the dominant subspace is spurious. Surprisingly, projecting out the dominant subspace proves to be just as effective as the original update, despite removing the majority of the original update component. We observe similar behavior across practical setups, including the large learning rate regime (also known as Edge of Stability), Sharpness-Aware Minimization, momentum, and adaptive optimizers. We discuss the main causes and implications of this spurious alignment, shedding light on the dynamics of neural network training.

Does SGD really happen in tiny subspaces?

TL;DR

This work challenges the notion that neural network optimization primarily unfolds in a tiny dominant Hessian subspace. By projecting SGD updates onto the top-

dominant subspace and onto its orthogonal bulk, the authors show that learning does not progress when confined to the dominant directions, while bulk-subspace updates drive comparable progress to standard SGD. The spurious gradient alignment with the dominant subspace is traced to stochastic noise in SGD (and to self-stabilization in the Edge of Stability regime), and this phenomenon persists under SAM and across momentum/Adam variants, which instead accelerate learning by amplifying bulk-direction updates. The findings suggest that practical optimization should focus on bulk subspace dynamics rather than relying on dominant-subspace guidance, with potential implications for more efficient training strategies and a deeper understanding of optimization in high-dimensional loss landscapes.

Abstract

Paper Structure (42 sections, 20 equations, 47 figures, 3 tables)

This paper contains 42 sections, 20 equations, 47 figures, 3 tables.

Introduction
Summary of main results
Starting point: gradient aligns with the dominant subspace
Neural networks cannot be trained within dominant subspaces
What do we expect based on quadratic Taylor approximation?
The "spurious" alignment with the dominant subspace
Bulk subspace is where the learning happens
What causes the spurious alignment with dominant subspaces?
Stochastic noise of SGD is the main cause
Understanding the role of stochastic noise via a toy quadratic model
Revisiting our preliminary analysis (\ref{['sec:prelim_analysis']})
Edge of Stability and Sharpness-Aware Minimization
Edge of Stability
Sharpness-Aware Minimization
Momentum and adaptive methods amplify updates in bulk subspaces
...and 27 more sections

Figures (47)

Figure 1: The summary of our main results in \ref{['sec:training_dominant']} (training loss in log-scale). For neural network training, gur2018gradient observe that gradients approximately align with the dominant subspace, spanned by the dominant eigenvectors of the training loss Hessian. To see whether such phenomenon lets us train neural networks within the dominant subspace, we implement \ref{['dom-sgd']}, where each SGD update is projected onto the dominant subspace. Surprisingly, training stops after this modification, suggesting that the dominant subspace is not where the learning happens. In contrast, \ref{['bulk-sgd']}, where we project each SGD updates onto the bulk subspace orthogonal to the dominant subspace, is just as effective as the original update, despite removing the majority of original updates. Experimental details are provided in \ref{['appendix:exp_detail']}.
Figure 1: Mean effective learning rates over the first 1000 steps (numbers in parentheses show standard deviation). Training Transformer on SST2-1k using GD and Adam with (+m) and without (-m) momentum. GD uses a learning rate of $0.01$, and Adam uses a learning rate of $0.001$. Momentum is set to $\beta = 0.9$.
Figure 2: Low-rank structure of the Hessian. The plot shows the top eigenvalues of the loss Hessian during SGD training. The blue curves represent the top-$k$ eigenvalues, which are significantly larger than the next top-$k$ eigenvalues, shown in orange. Here, $k$ corresponds the number of classes in the classification task ($k=10$ for MNIST-5k, CIFAR10-5k, and $k=2$ for SST2-1k).
Figure 2: Mean effective learning rates over the first 1000 steps (numbers in parentheses show standard deviation). Training MLP on MNIST-5k using GD and Adam with (+m) and without (-m) momentum. GD uses a learning rate of $0.01$, Adam uses a learning rate of $0.001$. Momentum is set to $\beta = 0.9$.
Figure 3: Alignment of gradients with dominant subspaces. The plot illustrates $\chi_{k}(\nabla L (\theta_t))$ during SGD training, where $k$ is the number of classes for the classification task (see \ref{['def:dom_proj']}). The orange dashed lines represent the exponential moving average (EMA) of $\chi_{k}(\nabla L (\theta_t))$. After a few early steps, $\chi_{k}(\nabla L (\theta_t))$ reaches and stays near $1$, indicating the alignment between gradients and dominant subspaces.
...and 42 more figures

Theorems & Definitions (6)

Definition 1: Dominant subspace
Definition 2: Dominant subspace projection
Remark 1: This is not the end-of-training phenomenon
Definition 3
Remark 2
Definition 4: Effective learning rate

Does SGD really happen in tiny subspaces?

TL;DR

Abstract

Does SGD really happen in tiny subspaces?

Authors

TL;DR

Abstract

Table of Contents

Figures (47)

Theorems & Definitions (6)