Table of Contents
Fetching ...

Stochastic Quasi-Gradient Methods: Variance Reduction via Jacobian Sketching

Robert M. Gower, Peter Richtárik, Francis Bach

TL;DR

The paper introduces JacSketch, a variance-reduced stochastic gradient method built on Jacobian sketching. By maintaining and updating a Jacobian estimate with sketch-and-project steps, JacSketch yields unbiased gradient estimators and provable linear convergence for smooth strongly convex objectives, unifying SAGA/minibatch variants under a single framework. It provides general and specialized convergence theorems, sharp results for SAGA with importance sampling, and a refined Lyapunov-based analysis that clarifies minibatch choices. The approach bridges stochastic optimization with randomized linear algebra, offering reduced-memory variants and practical guidance on minibatch sizing and nonuniform sampling. Empirical results corroborate theoretical gains, and the work outlines avenues like structured weight matrices and Johnson-Lindenstrauss sketches for further improvements.

Abstract

We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method --JacSketch-- is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (cheap) sketching, and then projecting the previous estimate onto the solution space of a linear matrix equation whose solutions are consistent with the measurement. The Jacobian estimate is then used to compute a variance-reduced unbiased estimator of the gradient. Our strategy is analogous to the way quasi-Newton methods maintain an estimate of the Hessian, and hence our method can be seen as a stochastic quasi-gradient method. We prove that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches. We also provide a refined convergence theorem which applies to a smaller class of sketches. This enables us to obtain sharper complexity results for variants of JacSketch with importance sampling. By specializing our general approach to specific sketching strategies, JacSketch reduces to the stochastic average gradient (SAGA) method, and several of its existing and many new minibatch, reduced memory, and importance sampling variants. Our rate for SAGA with importance sampling is the current best-known rate for this method, resolving a conjecture by Schmidt et al (2015). The rates we obtain for minibatch SAGA are also superior to existing rates.

Stochastic Quasi-Gradient Methods: Variance Reduction via Jacobian Sketching

TL;DR

The paper introduces JacSketch, a variance-reduced stochastic gradient method built on Jacobian sketching. By maintaining and updating a Jacobian estimate with sketch-and-project steps, JacSketch yields unbiased gradient estimators and provable linear convergence for smooth strongly convex objectives, unifying SAGA/minibatch variants under a single framework. It provides general and specialized convergence theorems, sharp results for SAGA with importance sampling, and a refined Lyapunov-based analysis that clarifies minibatch choices. The approach bridges stochastic optimization with randomized linear algebra, offering reduced-memory variants and practical guidance on minibatch sizing and nonuniform sampling. Empirical results corroborate theoretical gains, and the work outlines avenues like structured weight matrices and Johnson-Lindenstrauss sketches for further improvements.

Abstract

We develop a new family of variance reduced stochastic gradient descent methods for minimizing the average of a very large number of smooth functions. Our method --JacSketch-- is motivated by novel developments in randomized numerical linear algebra, and operates by maintaining a stochastic estimate of a Jacobian matrix composed of the gradients of individual functions. In each iteration, JacSketch efficiently updates the Jacobian matrix by first obtaining a random linear measurement of the true Jacobian through (cheap) sketching, and then projecting the previous estimate onto the solution space of a linear matrix equation whose solutions are consistent with the measurement. The Jacobian estimate is then used to compute a variance-reduced unbiased estimator of the gradient. Our strategy is analogous to the way quasi-Newton methods maintain an estimate of the Hessian, and hence our method can be seen as a stochastic quasi-gradient method. We prove that for smooth and strongly convex functions, JacSketch converges linearly with a meaningful rate dictated by a single convergence theorem which applies to general sketches. We also provide a refined convergence theorem which applies to a smaller class of sketches. This enables us to obtain sharper complexity results for variants of JacSketch with importance sampling. By specializing our general approach to specific sketching strategies, JacSketch reduces to the stochastic average gradient (SAGA) method, and several of its existing and many new minibatch, reduced memory, and importance sampling variants. Our rate for SAGA with importance sampling is the current best-known rate for this method, resolving a conjecture by Schmidt et al (2015). The rates we obtain for minibatch SAGA are also superior to existing rates.

Paper Structure

This paper contains 54 sections, 20 theorems, 174 equations, 3 figures, 2 tables, 3 algorithms.

Key Result

lemma 5

If $\mathbf{S}$ is an unbiased sketch (see Definition def:sketch), then for every ${\bf J} \in \mathbb{R}^{d \times n}$ and $x\in \mathbb{R}^d$. That is, eq:controlgradJ is an unbiased estimate of the gradient eq:prob.

Figures (3)

  • Figure 1: Comparing the performance of SAGA with importance sampling based on the optimized probabilities \ref{['eq:optprobs']} (SAGA-opt), $p_i = L_i/\overline{L}$ (SAGA-Li) and $p_i = 1/n$ (SAGA-uni) for an artificially constructed ridge regression problem as $n$ grows.
  • Figure 2: The iteration complexity of minibatch SAGA \ref{['eq:SAGAmini-nice']} vs the mini-batch size $\tau$ for two ridge regression problems \ref{['eq:ridge']}. We used $\lambda = L_{\max}/n.$
  • Figure 3: Comparison of the methods on logistic regression problems \ref{['eq:logistic']} with data taken from LIBSVM Chang2011.

Theorems & Definitions (54)

  • example 2: gradient descent
  • example 3: SGD with non-uniform sampling
  • example 3: SGD with non-uniform sampling
  • example 4: minibatch SGD
  • lemma 5
  • example 6: Zero sketch residual
  • example 7: Large sketch residual
  • remark 8: On the weight matrix and the cost
  • lemma 11
  • proof
  • ...and 44 more