Table of Contents
Fetching ...

SAPPHIRE: Preconditioned Stochastic Variance Reduction for Faster Large-Scale Statistical Learning

Jingruo Sun, Zachary Frangella, Madeleine Udell

TL;DR

SAPPHIRE tackles ill-conditioned, regularized empirical risk minimization at scale by marrying sketching-based preconditioning (SSN and NySSN) with variance-reduced gradients and a scaled proximal mapping for non-smooth penalties. The method achieves condition-number-free linear convergence under quadratic regularity and remains robust with infrequent preconditioner updates, while providing ergodic sublinear rates in broader convex settings and local linear convergence independent of conditioning. Theoretical results are complemented by extensive experiments on convex (e.g., Lasso, logistic with elastic-net) and non-convex (e.g., SCAD, MCP) penalties, showing up to 20x faster convergence than key baselines. The work offers a scalable, practical framework for large-scale statistical learning in domains with highly ill-conditioned data, such as genomics and advertising, by leveraging efficient preconditioning, variance reduction, and proximal updates.

Abstract

Regularized empirical risk minimization (rERM) has become important in data-intensive fields such as genomics and advertising, with stochastic gradient methods typically used to solve the largest problems. However, ill-conditioned objectives and non-smooth regularizers undermine the performance of traditional stochastic gradient methods, leading to slow convergence and significant computational costs. To address these challenges, we propose the $\texttt{SAPPHIRE}$ ($\textbf{S}$ketching-based $\textbf{A}$pproximations for $\textbf{P}$roximal $\textbf{P}$reconditioning and $\textbf{H}$essian $\textbf{I}$nexactness with Variance-$\textbf{RE}$educed Gradients) algorithm, which integrates sketch-based preconditioning to tackle ill-conditioning and uses a scaled proximal mapping to minimize the non-smooth regularizer. This stochastic variance-reduced algorithm achieves condition-number-free linear convergence to the optimum, delivering an efficient and scalable solution for ill-conditioned composite large-scale convex machine learning problems. Extensive experiments on lasso and logistic regression demonstrate that $\texttt{SAPPHIRE}$ often converges $20$ times faster than other common choices such as $\texttt{Catalyst}$, $\texttt{SAGA}$, and $\texttt{SVRG}$. This advantage persists even when the objective is non-convex or the preconditioner is infrequently updated, highlighting its robust and practical effectiveness.

SAPPHIRE: Preconditioned Stochastic Variance Reduction for Faster Large-Scale Statistical Learning

TL;DR

SAPPHIRE tackles ill-conditioned, regularized empirical risk minimization at scale by marrying sketching-based preconditioning (SSN and NySSN) with variance-reduced gradients and a scaled proximal mapping for non-smooth penalties. The method achieves condition-number-free linear convergence under quadratic regularity and remains robust with infrequent preconditioner updates, while providing ergodic sublinear rates in broader convex settings and local linear convergence independent of conditioning. Theoretical results are complemented by extensive experiments on convex (e.g., Lasso, logistic with elastic-net) and non-convex (e.g., SCAD, MCP) penalties, showing up to 20x faster convergence than key baselines. The work offers a scalable, practical framework for large-scale statistical learning in domains with highly ill-conditioned data, such as genomics and advertising, by leveraging efficient preconditioning, variance reduction, and proximal updates.

Abstract

Regularized empirical risk minimization (rERM) has become important in data-intensive fields such as genomics and advertising, with stochastic gradient methods typically used to solve the largest problems. However, ill-conditioned objectives and non-smooth regularizers undermine the performance of traditional stochastic gradient methods, leading to slow convergence and significant computational costs. To address these challenges, we propose the (ketching-based pproximations for roximal reconditioning and essian nexactness with Variance-educed Gradients) algorithm, which integrates sketch-based preconditioning to tackle ill-conditioning and uses a scaled proximal mapping to minimize the non-smooth regularizer. This stochastic variance-reduced algorithm achieves condition-number-free linear convergence to the optimum, delivering an efficient and scalable solution for ill-conditioned composite large-scale convex machine learning problems. Extensive experiments on lasso and logistic regression demonstrate that often converges times faster than other common choices such as , , and . This advantage persists even when the objective is non-convex or the preconditioner is infrequently updated, highlighting its robust and practical effectiveness.

Paper Structure

This paper contains 55 sections, 19 theorems, 105 equations, 7 figures, 4 tables, 3 algorithms.

Key Result

Lemma 3.1

For any $\rho \geq 0$ and $w\in \mathbb{R}^p$, the following inequalities holds where $M(w) \coloneqq \lambda_{\textup{max}}(\nabla^2 \ell_i(w))$.

Figures (7)

  • Figure 1: SAPPHIRE significantly outperforms competing stochastic optimizers on a large-scale click prediction problem with the avazu dataset $(n=12,642,186, \ p=999,990)$.
  • Figure 2: $L_1$-logistic regression: SAPPHIRE vs. competing methods on rcv1 and covtype
  • Figure 3: Lasso: SAPPHIRE vs. competing methods on rna-seq and yearmsd
  • Figure 4: Logistic regression: SAPPHIRE vs. competing methods on avazu and url
  • Figure 5: Least-Square regression with SCAD regularization
  • ...and 2 more figures

Theorems & Definitions (33)

  • Definition 1
  • Remark 3.1
  • Lemma 3.1
  • Lemma 3.2
  • Lemma 3.3
  • Definition 2: Quadratic Regularity
  • Remark 4.1
  • Lemma 4.1: Smoothness and strong-convexity imply quadratic regularity
  • Definition 3: $\rho$-weak quadratic regularity
  • Lemma 4.2: Smoothness and convexity imply $\rho$-weak quadratic regularity
  • ...and 23 more