Table of Contents
Fetching ...

A Sketch-and-Project Analysis of Subsampled Natural Gradient Algorithms

Gil Goldshlager, Jiang Hu, Lin Lin

TL;DR

This work reframes subsampled natural gradient optimization through a sketch-and-project lens, showing that squared-volume sampling (SVS) yields a faithful proxy in small-sample regimes and eliminates the need for decoupling the gradient and preconditioner. It establishes global convergence of SVS-SNG with a single mini-batch and provides an explicit LLQ convergence rate governed by sketch-and-project parameters $oldsymbol{\\alpha}$ and $oldsymbol{\\gamma}$, revealing that SNG can exploit spectral decay more effectively than SGD. The authors connect SPRING to accelerated sketch-and-project methods, deriving a bound with rate $oldsymbol{\\sqrt{\alpha/\beta}}$ and providing empirical support for acceleration in small-sample settings. Collectively, the paper advocates prioritizing sketch-and-project properties over gradient-variance proxies when analyzing and designing subsampled natural gradient algorithms for high-precision scientific machine learning tasks, with extensions to Gauss-Newton and implications for practical sampling strategies.

Abstract

Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.

A Sketch-and-Project Analysis of Subsampled Natural Gradient Algorithms

TL;DR

This work reframes subsampled natural gradient optimization through a sketch-and-project lens, showing that squared-volume sampling (SVS) yields a faithful proxy in small-sample regimes and eliminates the need for decoupling the gradient and preconditioner. It establishes global convergence of SVS-SNG with a single mini-batch and provides an explicit LLQ convergence rate governed by sketch-and-project parameters and , revealing that SNG can exploit spectral decay more effectively than SGD. The authors connect SPRING to accelerated sketch-and-project methods, deriving a bound with rate and providing empirical support for acceleration in small-sample settings. Collectively, the paper advocates prioritizing sketch-and-project properties over gradient-variance proxies when analyzing and designing subsampled natural gradient algorithms for high-precision scientific machine learning tasks, with extensions to Gauss-Newton and implications for practical sampling strategies.

Abstract

Subsampled natural gradient descent (SNG) has been used to enable high-precision scientific machine learning, but standard analyses based on stochastic preconditioning fail to provide insight into realistic small-sample settings. We overcome this limitation by instead analyzing SNG as a sketch-and-project method. Motivated by this lens, we discard the usual theoretical proxy which decouples gradients and preconditioners using two independent mini-batches, and we replace it with a new proxy based on squared volume sampling. Under this new proxy we show that the expectation of the SNG direction becomes equal to a preconditioned gradient descent step even in the presence of coupling, leading to (i) global convergence guarantees when using a single mini-batch of any size, and (ii) an explicit characterization of the convergence rate in terms of quantities related to the sketch-and-project structure. These findings in turn yield new insights into small-sample settings, for example by suggesting that the advantage of SNG over SGD is that it can more effectively exploit spectral decay in the model Jacobian. We also extend these ideas to explain a popular structured momentum scheme for SNG, known as SPRING, by showing that it arises naturally from accelerated sketch-and-project methods.

Paper Structure

This paper contains 21 sections, 14 theorems, 96 equations, 6 figures.

Key Result

Lemma 4.1

Let $J \in \mathbb{R}^{m \times n}$, $r \in \mathbb{R}^m$, and $S{\sim}$SVS($J,k,\lambda$). Then it holds with $\tilde{W} = (J^\top J)^{-1/2}\, \overline{P}\, (J^\top J)^{-1/2}$ and $\overline{P}$ as in (eq:alpha_p).

Figures (6)

  • Figure 1: Summary of technical contributions relative to previous analyses based on stochastic preconditioning.
  • Figure 2: Empirical performance of two theoretical proxies for SNG relative to a realistic algorithm using a single, uniformly sampled mini-batch. The results are for a consistent instance of linear least-quadratics involving random Gaussian matrices and the behavior is tested for a fixed step size $\eta$ and varying regularizations $\lambda$ (left), as well as for fixed $\lambda$ and varying $\eta$ (right). In both cases the behavior of SVS-SNG matches closely with the realistic algorithm, while the behavior of the two mini-batch proxy is entirely distinct. The dimensions are $m=10^3$, $n=10^2$, and $k=10$, which is small enough to directly implement SVS. The $y$-axis represents the relative error achieved after $T=10^3$ iterations. See also Appendix \ref{['app:ablation']} for an ablation regarding non-Gaussian data.
  • Figure 3: Empirical scaling of the convergence parameters $\alpha$ and $\gamma$ as a function of the sample size $k$, under SVS. The results are for a random Gaussian Jacobian matrix with quadratic spectral decay. The sketch-and-project constant $\alpha$ grows superlinearly as predicted, and the presence of $\gamma$ only mildly impedes this superlinear scaling. The regularization is set to $\lambda=0$ for simplicity and the dimensions are $m=10^3$ and $n=10^2$, which is small enough to (i) calculate $\alpha$ directly using symmetric polynomials, and (ii) estimate $\gamma$ by sampling directly from SVS($J,k,\lambda$).
  • Figure 4: Empirical validation that SNG can exploit spectral decay in the model Jacobian, under uniform sampling. The results are for a consistent instance of linear least-quadratics involving random Gaussian matrices, with $J$ exhibiting quadratic spectral decay. The empirical rate constant grows superlinearly as predicted even while the step size is fixed at $\eta=1$. The regularization is set to $\lambda=0$ for simplicity and the dimensions are $m=10^3$ and $n=10^2$. Inset: when $J$ has a flat spectrum, the rate scales linearly.
  • Figure 5: Empirical comparison of the convergence rates of SNG and SPRING for a variety of sample sizes. The results are for a consistent instance of linear least-quadratics in which $J$ and $H$ both have linearly decaying spectra. As suggested by \ref{['conj:spring']}, SPRING provides substantial acceleration for small sample sizes, but the gap shrinks as the sample size is increased. The regularization is set to $\lambda=0$ for simplicity and the dimensions are $m=10^3$, $n=10^2$. Furthermore, the hyperparameters $\eta$ and $\mu$ are tuned independently for each run.
  • ...and 1 more figures

Theorems & Definitions (29)

  • Lemma 4.1
  • Theorem 4.2: Global convergence of SVS-SNG
  • Theorem 5.1: Convergence of SVS-SNG for LLQ
  • Corollary 5.2: Convergence of NGD for LLQ
  • Proposition 5.3
  • Proposition 5.4: Convergence under strong compatibility
  • Theorem 6.1: SPRING as accelerated sketch-and-project
  • Conjecture 6.2: Convergence of SVS-SPRING for LLQ
  • Lemma 1.1: Key expectation formula under SVS
  • proof
  • ...and 19 more