Stochastic gradient with least-squares control variates
Fabio Nobile, Matteo Raviola, Nathan Schaeffer
TL;DR
This work tackles stochastic optimization where the objective is an expectation J(u) = \mathbb{E}_{Y\sim\rho}[g(u,Y)], a setting where gradient evaluations are expensive and ρ is known. It introduces SG-LSCV, a memory-based control-variate method that fits a linear gradient surrogate via optimal weighted least-squares using past samples, then uses this surrogate to construct a variance-reduced gradient update that preserves the per-iteration cost of SGD. The authors prove convergence guarantees for both fixed and growing approximation spaces, showing exponential or algebraic decay depending on the gradient-projection error and step-size scheduling, and they demonstrate the approach on PDE-constrained optimization problems with uncertainties. The results indicate substantial improvements over SGD and finite-sum VR methods like SAGA, especially in high-dimensional or continuous-parameter settings, by exploiting gradient regularity through structured polynomial approximations and optimal sampling. The methods offer a scalable framework for variance reduction in continuous stochastic optimization and have potential applications in ML contexts where the data-generating distribution is known and gradient smoothness can be exploited.
Abstract
The stochastic gradient descent (SGD) method is a widely used approach for solving stochastic optimization problems, but its convergence is typically slow. Existing variance reduction techniques, such as SAGA, improve convergence by leveraging stored gradient information; however, they are restricted to settings where the objective functional is a finite sum, and their performance degrades when the number of terms in the sum is large. In this work, we propose a novel approach which is well suited when the objective is given by an expectation over random variables with a continuous probability distribution. Our method constructs a control variate by fitting a linear model to past gradient evaluations using weighted discrete least-squares, effectively reducing variance while preserving computational efficiency. We establish theoretical sublinear convergence guarantees for strongly convex objectives and demonstrate the method's effectiveness through numerical experiments on random PDE-constrained optimization problems.
