One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Filip Hanzely; Peter Richtárik

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

Filip Hanzely, Peter Richtárik

TL;DR

The paper introduces Generalized Jacobian Sketching (GJS), a unifying variance-reduction framework for solving regularized ERM problems in regimes with many data points or high model dimension. By encoding gradient information through random sketches and maintaining a Jacobian proxy that converges to the true Jacobian at the optimum, GJS provides a single linear-convergence theorem under $\sigma$-quasi-strong convexity and $\{ {\bf M}_j\}$-smoothness, covering a broad family of methods. It unifies SGD and coordinate-descent-like approaches (RCD/SAGA-family) and shows how many existing algorithms (SAGA, SEGA, JacSketch, LSVRG, ISEGA) and many new variants arise as special cases under arbitrary sampling and proximal extensions. The framework supports extensive generalizations, including proximal objectives and arbitrary sampling, and is supported by experiments on LibSVM datasets that illustrate practical performance gains and parallel scalability. Overall, GJS provides a cohesive, flexible theory and toolkit for variance-reduced stochastic optimization in large-scale data and high-dimensional settings.

Abstract

We propose a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both. In special cases, our method reduces to several known and previously thought to be unrelated methods, such as {\tt SAGA}, {\tt LSVRG}, {\tt JacSketch}, {\tt SEGA} and {\tt ISEGA}, and their arbitrary sampling and proximal generalizations. However, we also highlight a large number of new specific algorithms with interesting properties. We provide a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. With this theorem we recover best-known and sometimes improved rates for known methods arising in special cases. As a by-product, we provide the first unified method and theory for stochastic gradient and stochastic coordinate descent type methods.

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

TL;DR

Abstract

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (46)