Optimal sampling for stochastic and natural gradient descent

Robert Gruhlke; Anthony Nouy; Philipp Trunschke

Optimal sampling for stochastic and natural gradient descent

Robert Gruhlke, Anthony Nouy, Philipp Trunschke

TL;DR

This work develops a stochastic optimization framework on nonlinear model classes that uses optimal sampling to bound gradient estimator variance and integrates local linearisations via tangent-space projections and retractions. By analyzing unbiased, quasi-, and LS-based projection estimators within a natural-gradient-like descent, it proves almost-sure convergence to stationary points of the true objective and PL-type generalisation bounds, while achieving convergence rates comparable to deterministic first-order methods under suitable conditions. The approach leverages optimal sampling (including volume sampling and DPPs) to control the variance, with explicit rates provided for both unbiased and biased projection settings across linear and shallow neural network models. Overall, the methodology offers a principled path to near-deterministic convergence in stochastic settings by coupling geometry-aware projections with careful step-size strategies, and it demonstrates practical gains in linear and NN experiments.

Abstract

We consider the problem of optimising the expected value of a loss functional over a nonlinear model class of functions, assuming that we have only access to realisations of the gradient of the loss. This is a classical task in statistics, machine learning and physics-informed machine learning. A straightforward solution is to replace the exact objective with a Monte Carlo estimate before employing standard first-order methods like gradient descent, which yields the classical stochastic gradient descent method. But replacing the true objective with an estimate ensues a generalisation error. Rigorous bounds for this error typically require strong compactness and Lipschitz continuity assumptions while providing a very slow decay with sample size. To alleviate these issues, we propose a version of natural gradient descent that is based on optimal sampling methods. Under classical assumptions on the loss and the nonlinear model class, we prove that this scheme converges almost surely monotonically to a stationary point of the true objective. Under Polyak-Lojasiewicz-type conditions, this provides bounds for the generalisation error. As a remarkable result, we show that our stochastic optimisation scheme achieves the linear or exponential convergence rates of deterministic first order descent methods under suitable conditions.

Optimal sampling for stochastic and natural gradient descent

TL;DR

Abstract

Paper Structure (46 sections, 28 theorems, 191 equations, 8 figures, 2 tables)

This paper contains 46 sections, 28 theorems, 191 equations, 8 figures, 2 tables.

Introduction
Setting
Contributions
Related work
Outline
Assumptions
Estimators of orthogonal projections
Non-projection
Quasi-projection
Least squares projection
Convergence theory
Convergence in expectation in the unbiased case
Almost sure convergence in the unbiased case
Almost sure convergence under assumption \ref{['eq:L-smooth']}
Almost sure convergence under assumption \ref{['eq:mu-PL_strong']}
...and 31 more sections

Key Result

theorem 4.1

Assume that the loss function satisfies assumptions eq:L-smooth and eq:C-retraction with a $t$-dependent perturbation $\beta_t\ge 0$ and that the projection estimate satisfies assumption eq:bbv. Then at step $t\ge0$ it holds that

Figures (8)

Figure 1: Illustration of the proposed algorithm. Starting from the iterate $u_t\in\mathcal{M}$, an approximation $P_t^n g_t \in \mathcal{T}_t$ of the true gradient $g_t$ is computed via a random projection $P_t^n$ onto the linear space $\mathcal{T}_t$. Subsequently, an intermediate, linear update $\bar{u}_{t+1} = u_t - s_t P_t^n g_t$ is performed with step size $s_t$. Finally, the next iterate $u_{t+1}\in\mathcal{M}$ is obtained through application of the retraction map $R_t$.
Figure 2: Loss error $\mathcal{L}(u_t) - \mathcal{L}_{\mathrm{min}, \mathcal{M}}$, plotted against the number of steps
Figure 3: Loss error $\mathcal{L}(u_t) - \mathcal{L}_{\mathrm{min}, \mathcal{M}}$, plotted against the number of steps
Figure 4: Loss error $\mathcal{L}(u_t) - \mathcal{L}_{\mathrm{min}, \mathcal{M}}$, plotted against the number of steps
Figure 5: Loss error $\mathcal{L}(u_t) - \mathcal{L}_{\mathrm{min}, \mathcal{M}}$, plotted against the number of steps
...and 3 more figures

Theorems & Definitions (66)

remark 1.1
remark 3.1: Quasi-projection yields natural gradient descent
remark 3.2
remark 3.3
remark 3.4
theorem 4.1
proof
theorem 4.2
proof
corollary 4.3
...and 56 more

Optimal sampling for stochastic and natural gradient descent

TL;DR

Abstract

Optimal sampling for stochastic and natural gradient descent

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (66)