Optimal sampling for stochastic and natural gradient descent
Robert Gruhlke, Anthony Nouy, Philipp Trunschke
TL;DR
This work develops a stochastic optimization framework on nonlinear model classes that uses optimal sampling to bound gradient estimator variance and integrates local linearisations via tangent-space projections and retractions. By analyzing unbiased, quasi-, and LS-based projection estimators within a natural-gradient-like descent, it proves almost-sure convergence to stationary points of the true objective and PL-type generalisation bounds, while achieving convergence rates comparable to deterministic first-order methods under suitable conditions. The approach leverages optimal sampling (including volume sampling and DPPs) to control the variance, with explicit rates provided for both unbiased and biased projection settings across linear and shallow neural network models. Overall, the methodology offers a principled path to near-deterministic convergence in stochastic settings by coupling geometry-aware projections with careful step-size strategies, and it demonstrates practical gains in linear and NN experiments.
Abstract
We consider the problem of optimising the expected value of a loss functional over a nonlinear model class of functions, assuming that we have only access to realisations of the gradient of the loss. This is a classical task in statistics, machine learning and physics-informed machine learning. A straightforward solution is to replace the exact objective with a Monte Carlo estimate before employing standard first-order methods like gradient descent, which yields the classical stochastic gradient descent method. But replacing the true objective with an estimate ensues a generalisation error. Rigorous bounds for this error typically require strong compactness and Lipschitz continuity assumptions while providing a very slow decay with sample size. To alleviate these issues, we propose a version of natural gradient descent that is based on optimal sampling methods. Under classical assumptions on the loss and the nonlinear model class, we prove that this scheme converges almost surely monotonically to a stationary point of the true objective. Under Polyak-Lojasiewicz-type conditions, this provides bounds for the generalisation error. As a remarkable result, we show that our stochastic optimisation scheme achieves the linear or exponential convergence rates of deterministic first order descent methods under suitable conditions.
