A Sketch-and-Project Analysis of Subsampled Natural Gradient Algorithms
Gil Goldshlager, Jiang Hu, Lin Lin
TL;DR
This work reframes subsampled natural gradient optimization through a sketch-and-project lens, showing that squared-volume sampling (SVS) yields a faithful proxy in small-sample regimes and eliminates the need for decoupling the gradient and preconditioner. It establishes global convergence of SVS-SNG with a single mini-batch and provides an explicit LLQ convergence rate governed by sketch-and-project parameters $oldsymbol{\\alpha}$ and $oldsymbol{\\gamma}$, revealing that SNG can exploit spectral decay more effectively than SGD. The authors connect SPRING to accelerated sketch-and-project methods, deriving a bound with rate $oldsymbol{\\sqrt{\alpha/\beta}}$ and providing empirical support for acceleration in small-sample settings. Collectively, the paper advocates prioritizing sketch-and-project properties over gradient-variance proxies when analyzing and designing subsampled natural gradient algorithms for high-precision scientific machine learning tasks, with extensions to Gauss-Newton and implications for practical sampling strategies.
Abstract
Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.
