Table of Contents
Fetching ...

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Filip Hanzely, Peter Richtárik

TL;DR

The paper introduces Generalized Jacobian Sketching (GJS), a unifying variance-reduction framework for solving regularized ERM problems in regimes with many data points or high model dimension. By encoding gradient information through random sketches and maintaining a Jacobian proxy that converges to the true Jacobian at the optimum, GJS provides a single linear-convergence theorem under $\sigma$-quasi-strong convexity and $\{ {\bf M}_j\}$-smoothness, covering a broad family of methods. It unifies SGD and coordinate-descent-like approaches (RCD/SAGA-family) and shows how many existing algorithms (SAGA, SEGA, JacSketch, LSVRG, ISEGA) and many new variants arise as special cases under arbitrary sampling and proximal extensions. The framework supports extensive generalizations, including proximal objectives and arbitrary sampling, and is supported by experiments on LibSVM datasets that illustrate practical performance gains and parallel scalability. Overall, GJS provides a cohesive, flexible theory and toolkit for variance-reduced stochastic optimization in large-scale data and high-dimensional settings.

Abstract

We propose a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both. In special cases, our method reduces to several known and previously thought to be unrelated methods, such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA}, and their arbitrary sampling and proximal generalizations. However, we also highlight a large number of new specific algorithms with interesting properties. We provide a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. With this theorem we recover best-known and sometimes improved rates for known methods arising in special cases. As a by-product, we provide the first unified method and theory for stochastic gradient and stochastic coordinate descent type methods.

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

TL;DR

The paper introduces Generalized Jacobian Sketching (GJS), a unifying variance-reduction framework for solving regularized ERM problems in regimes with many data points or high model dimension. By encoding gradient information through random sketches and maintaining a Jacobian proxy that converges to the true Jacobian at the optimum, GJS provides a single linear-convergence theorem under -quasi-strong convexity and -smoothness, covering a broad family of methods. It unifies SGD and coordinate-descent-like approaches (RCD/SAGA-family) and shows how many existing algorithms (SAGA, SEGA, JacSketch, LSVRG, ISEGA) and many new variants arise as special cases under arbitrary sampling and proximal extensions. The framework supports extensive generalizations, including proximal objectives and arbitrary sampling, and is supported by experiments on LibSVM datasets that illustrate practical performance gains and parallel scalability. Overall, GJS provides a cohesive, flexible theory and toolkit for variance-reduced stochastic optimization in large-scale data and high-dimensional settings.

Abstract

We propose a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both. In special cases, our method reduces to several known and previously thought to be unrelated methods, such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA}, and their arbitrary sampling and proximal generalizations. However, we also highlight a large number of new specific algorithms with interesting properties. We provide a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. With this theorem we recover best-known and sometimes improved rates for known methods arising in special cases. As a by-product, we provide the first unified method and theory for stochastic gradient and stochastic coordinate descent type methods.

Paper Structure

This paper contains 69 sections, 27 theorems, 142 equations, 6 figures, 4 tables, 18 algorithms.

Key Result

Theorem 5.1

Let Assumption as:smooth_strongly_convex hold. Let ${\cal B}$ be any linear operator commuting with ${\cal S}$, and assume ${{\cal M}^\dagger}^{1/2}$ commutes with ${\cal S}$. Let ${\cal R}$ be any linear operator for which ${\cal R}({\bf J}^k) = {\cal R}({\bf G}(x^*))$ for every $k\geq 0$. Define t where $\{x^k\}$ and $\{{\bf J}^k\}$ are the random iterates produced by Algorithm alg:SketchJac wit

Figures (6)

  • Figure 1: Comparison of LSVRG & SAGA with importance and uniform sampling.
  • Figure 2: Comparison of SEGA-AS, SVRCD-AS, SEGA and proximal gradient on 4 quadratic problems given by Table \ref{['tbl:quadratics']}. SEGA-AS, SVRCD-AS and SEGA compute single partial derivative each iteration ( SVRCD computes all of them with probability ${\color{cyan} \rho}$), SEGA-AS, SVRCD-AS with probabilities proportional to diagonal of ${\bf M}$.
  • Figure 3: The effect of ${\color{cyan} \rho}$ on convergence rate of SVRCD on quadratic problems from Table \ref{['tbl:quadratics']}. In every case, probabilities were chosen proportionally to the diagonal of ${\bf M}$ and only a single partial derivative is evaluated in ${\cal S}$.
  • Figure 4: ISAEGA applied on LIBSVM chang2011libsvm datasts with $\lambda = 4\cdot 10^{-5}$. Axis $y$ stands for relative suboptimality, i.e. $\frac{f(x^k)-f(x^*)}{f(x^k)-f(x^0)}$.
  • Figure 5: LSVRG applied on LIBSVM chang2011libsvm datasets with $\lambda = 10^{-5}$. Axis $y$ stands for relative suboptimality, i.e. $\frac{f(x^k)-f(x^*)}{f(x^k)-f(x^0)}$.
  • ...and 1 more figures

Theorems & Definitions (46)

  • Definition 3.1
  • Theorem 5.1
  • remark 5.1
  • Lemma E.1
  • proof
  • Lemma E.2: Lemma A.1 from hanzely2018sega
  • proof
  • Lemma E.3
  • proof
  • Lemma E.4
  • ...and 36 more