Table of Contents
Fetching ...

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

Zachary Frangella, Pratik Rathore, Shipu Zhao, Madeleine Udell

TL;DR

SketchySGD introduces a practical stochastic quasi-Newton method that leverages randomized curvature estimates via a Nyström-based low-rank Hessian sketch to precondition SGD. It provides an automated learning-rate rule that adapts to preconditioned curvature, and it updates the preconditioner infrequently, making it scalable to very large datasets. Theoretical results establish convergence for smooth convex and strongly convex objectives, with improved iteration complexity in ill-conditioned or quadratic settings, and empirical evidence shows competitive or superior performance against first-order, quasi-Newton, and PCG methods across ridge, logistic, and large-scale deep-learning tasks. The combination of out-of-the-box defaults, robust conditioning improvements, and practical efficiency positions SketchySGD as a strong drop-in alternative for convex learning problems with challenging conditioning.

Abstract

SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems with dense and sparse data, show that SketchySGD equipped with its default hyperparameters can achieve comparable or better results than popular stochastic gradient methods, even when they have been tuned to yield their best performance. In particular, SketchySGD is able to solve an ill-conditioned logistic regression problem with a data matrix that takes more than $840$GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.

SketchySGD: Reliable Stochastic Optimization via Randomized Curvature Estimates

TL;DR

SketchySGD introduces a practical stochastic quasi-Newton method that leverages randomized curvature estimates via a Nyström-based low-rank Hessian sketch to precondition SGD. It provides an automated learning-rate rule that adapts to preconditioned curvature, and it updates the preconditioner infrequently, making it scalable to very large datasets. Theoretical results establish convergence for smooth convex and strongly convex objectives, with improved iteration complexity in ill-conditioned or quadratic settings, and empirical evidence shows competitive or superior performance against first-order, quasi-Newton, and PCG methods across ridge, logistic, and large-scale deep-learning tasks. The combination of out-of-the-box defaults, robust conditioning improvements, and practical efficiency positions SketchySGD as a strong drop-in alternative for convex learning problems with challenging conditioning.

Abstract

SketchySGD improves upon existing stochastic gradient methods in machine learning by using randomized low-rank approximations to the subsampled Hessian and by introducing an automated stepsize that works well across a wide range of convex machine learning problems. We show theoretically that SketchySGD with a fixed stepsize converges linearly to a small ball around the optimum. Further, in the ill-conditioned setting we show SketchySGD converges at a faster rate than SGD for least-squares problems. We validate this improvement empirically with ridge regression experiments on real data. Numerical experiments on both ridge and logistic regression problems with dense and sparse data, show that SketchySGD equipped with its default hyperparameters can achieve comparable or better results than popular stochastic gradient methods, even when they have been tuned to yield their best performance. In particular, SketchySGD is able to solve an ill-conditioned logistic regression problem with a data matrix that takes more than GB RAM to store, while its competitors, even when tuned, are unable to make any progress. SketchySGD's ability to work out-of-the box with its default hyperparameters and excel on ill-conditioned problems is an advantage over other stochastic gradient methods, most of which require careful hyperparameter tuning (especially of the learning rate) to obtain good performance and degrade in the presence of ill-conditioning.
Paper Structure (82 sections, 18 theorems, 112 equations, 43 figures, 8 tables, 6 algorithms)

This paper contains 82 sections, 18 theorems, 112 equations, 43 figures, 8 tables, 6 algorithms.

Key Result

Lemma 4.4

Let $h:\mathcal{C}\rightarrow \mathbb{R}$, where $\mathcal{C}$ is a closed convex subset of $\mathbb{R}^p$. Then the following items hold \newlabellemma:rel_quad0

Figures (43)

  • Figure 1: SketchySGD outperforms standard stochastic gradient optimizers, even when their parameters are tuned for optimal performance. Each optimizer was allowed 40 full data passes.
  • Figure 1: Comparisons to first-order methods with default learning rates (SVRG, SAGA) and smoothness parameters (L-Katyusha) on $l_2$-regularized logistic regression.
  • Figure 1: Sensitivity of SketchySGD to rank $r$.
  • Figure 1: Spectrum of the Hessian at epochs $0,10,20,30$ before and after preconditioning in $l_2$-regularized logistic regression.
  • Figure 1: Comparisons to quasi-Newton methods (L-BFGS, SLBFGS, RSN, Newton Sketch) on $l_2$-regularized logistic regression with augmented datasets.
  • ...and 38 more figures

Theorems & Definitions (35)

  • Remark 2.1
  • Definition 4.3: Relative quadratic regularity
  • Lemma 4.4: Smoothness and strong convexity implies quadratic regularity
  • Definition 4.5: $\rho$-dissimilarity
  • Lemma 4.6: $\rho$-dissimilarity never exceeds $n$
  • Proposition 4.7: $\rho$-dissimilarity is small for GLMs the machine learning setting
  • Lemma 4.8: Closeness in Loewner ordering between $H^\rho_S(w)$ and $H^\rho(w)$
  • Proposition 4.9: Closeness in Loewner ordering between $H(w)$ and $\hat{H}_{S}^{\rho}$
  • Proposition 4.10: Preconditioned expected smoothness and gradient variance
  • Corollary 4.12: Union bound
  • ...and 25 more