SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

Zachary Frangella; Pratik Rathore; Shipu Zhao; Madeleine Udell

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

Zachary Frangella, Pratik Rathore, Shipu Zhao, Madeleine Udell

TL;DR

SketchySGD introduces a practical stochastic quasi-Newton method that leverages randomized curvature estimates via a Nyström-based low-rank Hessian sketch to precondition SGD. It provides an automated learning-rate rule that adapts to preconditioned curvature, and it updates the preconditioner infrequently, making it scalable to very large datasets. Theoretical results establish convergence for smooth convex and strongly convex objectives, with improved iteration complexity in ill-conditioned or quadratic settings, and empirical evidence shows competitive or superior performance against first-order, quasi-Newton, and PCG methods across ridge, logistic, and large-scale deep-learning tasks. The combination of out-of-the-box defaults, robust conditioning improvements, and practical efficiency positions SketchySGD as a strong drop-in alternative for convex learning problems with challenging conditioning.

Abstract

SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems with dense and sparse data, show that SketchySGD equipped with its default hyperparameters can achieve comparable or better results than popular stochastic gradient methods, even when they have been tuned to yield their best performance. In particular, SketchySGD is able to solve an ill-conditioned logistic regression problem with a data matrix that takes more than $840$GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

TL;DR

Abstract

GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.

Paper Structure (82 sections, 18 theorems, 112 equations, 43 figures, 8 tables, 6 algorithms)

This paper contains 82 sections, 18 theorems, 112 equations, 43 figures, 8 tables, 6 algorithms.

Introduction
SketchySGD
Contributions
Roadmap
Notation
SketchySGD: efficient implementation and hyperparameter selection
Hessian vector product oracle
Randomized low-rank approximation
Setting the learning rate
Computing the SketchySGD update \ref{['eq:SketchySGDIter']} fast
Default parameters for \ref{['alg:SketchySGD']}
Comparison to previous work
Stochastic quasi-Newton methods for convex optimization
Stochastic quasi-Newton methods for non-convex optimization
Theory
...and 67 more sections

Key Result

Lemma 4.4

Let $h:\mathcal{C}\rightarrow \mathbb{R}$, where $\mathcal{C}$ is a closed convex subset of $\mathbb{R}^p$. Then the following items hold \newlabellemma:rel_quad0

Figures (43)

Figure 1: SketchySGD outperforms standard stochastic gradient optimizers, even when their parameters are tuned for optimal performance. Each optimizer was allowed 40 full data passes.
Figure 1: Comparisons to first-order methods with default learning rates (SVRG, SAGA) and smoothness parameters (L-Katyusha) on $l_2$-regularized logistic regression.
Figure 1: Sensitivity of SketchySGD to rank $r$.
Figure 1: Spectrum of the Hessian at epochs $0,10,20,30$ before and after preconditioning in $l_2$-regularized logistic regression.
Figure 1: Comparisons to quasi-Newton methods (L-BFGS, SLBFGS, RSN, Newton Sketch) on $l_2$-regularized logistic regression with augmented datasets.
...and 38 more figures

Theorems & Definitions (35)

Remark 2.1
Definition 4.3: Relative quadratic regularity
Lemma 4.4: Smoothness and strong convexity implies quadratic regularity
Definition 4.5: $\rho$-dissimilarity
Lemma 4.6: $\rho$-dissimilarity never exceeds $n$
Proposition 4.7: $\rho$-dissimilarity is small for GLMs the machine learning setting
Lemma 4.8: Closeness in Loewner ordering between $H^\rho_S(w)$ and $H^\rho(w)$
Proposition 4.9: Closeness in Loewner ordering between $H(w)$ and $\hat{H}_{S}^{\rho}$
Proposition 4.10: Preconditioned expected smoothness and gradient variance
Corollary 4.12: Union bound
...and 25 more

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

TL;DR

Abstract

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (43)

Theorems & Definitions (35)