Table of Contents
Fetching ...

DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering

Julien Nicolas, Mohamed Maouche, Sonia Ben Mokhtar, Mark Coates

TL;DR

This work addresses the fact that stochastic gradients in deep networks exhibit persistent high-variance directions aligned with Hessian outliers, which can inflate gradient norms without aiding long-horizon descent. It introduces DOME, an online method that builds a nuisance subspace from the centered gradient covariance using a streaming power method and then filters gradients by projecting away this subspace before applying the optimizer. The approach provides a first-order surrogate for sharp directions, demonstrates that removing nuisance components does not degrade and can even improve learning, and shows notable gains in high-noise or heavily compressed settings. Practically, DOME enhances gradient signal-to-noise, enabling more robust compression, privacy-preserving training, and potential benefits for continual learning, without modifying the underlying loss or descent directions.

Abstract

Stochastic gradients for deep neural networks exhibit strong correlations along the optimization trajectory, and are often aligned with a small set of Hessian eigenvectors associated with outlier eigenvalues. Recent work shows that projecting gradients away from this Hessian outlier subspace has little impact on optimization, despite capturing a large fraction of gradient variability. Since computing the Hessian is intractable in practice, we introduce a principled first-order characterization of the nuisance subspace based on the covariance of stochastic gradients, and propose an efficient method to estimate it online. We show that removing this subspace also has little impact on optimization, and yields practical benefits for applications sensitive to gradient signal-to-noise ratio such as gradient compression.

DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering

TL;DR

This work addresses the fact that stochastic gradients in deep networks exhibit persistent high-variance directions aligned with Hessian outliers, which can inflate gradient norms without aiding long-horizon descent. It introduces DOME, an online method that builds a nuisance subspace from the centered gradient covariance using a streaming power method and then filters gradients by projecting away this subspace before applying the optimizer. The approach provides a first-order surrogate for sharp directions, demonstrates that removing nuisance components does not degrade and can even improve learning, and shows notable gains in high-noise or heavily compressed settings. Practically, DOME enhances gradient signal-to-noise, enabling more robust compression, privacy-preserving training, and potential benefits for continual learning, without modifying the underlying loss or descent directions.

Abstract

Stochastic gradients for deep neural networks exhibit strong correlations along the optimization trajectory, and are often aligned with a small set of Hessian eigenvectors associated with outlier eigenvalues. Recent work shows that projecting gradients away from this Hessian outlier subspace has little impact on optimization, despite capturing a large fraction of gradient variability. Since computing the Hessian is intractable in practice, we introduce a principled first-order characterization of the nuisance subspace based on the covariance of stochastic gradients, and propose an efficient method to estimate it online. We show that removing this subspace also has little impact on optimization, and yields practical benefits for applications sensitive to gradient signal-to-noise ratio such as gradient compression.

Paper Structure

This paper contains 49 sections, 12 equations, 10 figures, 2 algorithms.

Figures (10)

  • Figure 1: Impact of filtering on training dynamics on CIFAR-10 when training a ResNet-8 with SGD, $lr=0.1$ and a batch size of 16 for 50 epochs. Left: Training loss as a function of epochs for the unfiltered optimizer and its filtered counterpart. Center: Average fraction of gradient norms lying in the dominant subspace (before applying filtering for the filtered version). Right: Top-1 accuracy as a function of epochs. Shaded areas indicate $99\%$ bootstrap confidence intervals over 5 random seeds.
  • Figure 2: Impact of filtering on training dynamics on TinyImageNet when training a ResNet-18 with SGD, $lr=0.1$, and a batch size of 64 for 50 epochs. Left: Training loss as a function of epochs for the unfiltered optimizer and its filtered counterpart. Center: Average fraction of gradient norms lying in the dominant subspace (before applying filtering for the filtered version). Right: Top-1 accuracy as a function of epochs. Shaded areas indicate $99\%$ bootstrap confidence intervals over 5 random seeds.
  • Figure 3: Compression tolerance. Test accuracy versus compression rate for Adam and DOME-filtered Adam using random Gaussian projection sketches, on MNIST with a ResNet-8. Training proceeds for 50 epochs with batch size 128. Curves report means over 5 runs; shaded regions indicate 95% bootstrap confidence intervals.
  • Figure 4: Impact of the nuisance subspace dimension. Test accuracy for Adam and DOME-filtered Adam under a fixed compression rate $d/m=10^3$ for varying subspace rank $k$, on MNIST with a ResNet-8. Curves report means over 5 runs; shaded regions indicate 95% bootstrap confidence intervals.
  • Figure 5: Centered covariance spectrum. Eigenvalue spectrum of the centered gradient covariance after training for 10 epochs on CIFAR-10 with batch size 128. The first dotted line is at rank $C=10$ and the second at rank $C^2=100$.
  • ...and 5 more figures