DOME: Improving Signal-to-Noise in Stochastic Gradient Descent via Sharp-Direction Subspace Filtering
Julien Nicolas, Mohamed Maouche, Sonia Ben Mokhtar, Mark Coates
TL;DR
This work addresses the fact that stochastic gradients in deep networks exhibit persistent high-variance directions aligned with Hessian outliers, which can inflate gradient norms without aiding long-horizon descent. It introduces DOME, an online method that builds a nuisance subspace from the centered gradient covariance using a streaming power method and then filters gradients by projecting away this subspace before applying the optimizer. The approach provides a first-order surrogate for sharp directions, demonstrates that removing nuisance components does not degrade and can even improve learning, and shows notable gains in high-noise or heavily compressed settings. Practically, DOME enhances gradient signal-to-noise, enabling more robust compression, privacy-preserving training, and potential benefits for continual learning, without modifying the underlying loss or descent directions.
Abstract
Stochastic gradients for deep neural networks exhibit strong correlations along the optimization trajectory, and are often aligned with a small set of Hessian eigenvectors associated with outlier eigenvalues. Recent work shows that projecting gradients away from this Hessian outlier subspace has little impact on optimization, despite capturing a large fraction of gradient variability. Since computing the Hessian is intractable in practice, we introduce a principled first-order characterization of the nuisance subspace based on the covariance of stochastic gradients, and propose an efficient method to estimate it online. We show that removing this subspace also has little impact on optimization, and yields practical benefits for applications sensitive to gradient signal-to-noise ratio such as gradient compression.
