Table of Contents
Fetching ...

Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

Michał Dereziński

TL;DR

This work addresses accelerating finite-sum convex minimization by integrating variance reduction with stochastic Newton updates. The authors prove that Stochastic Variance-Reduced Newton (SVRN) achieves a faster sequential convergence rate, reducing data passes to $O\left(\frac{\log(1/\varepsilon)}{\log(n)}\right)$ while maintaining a low parallel cost through Hessian-based updates, with the improvement scaling favorably as the dataset size $n$ grows. They also introduce SVRN-HA, a globally convergent variant that combines Hessian averaging and a line-search phase with a subsequent SVRN phase, guaranteeing practical convergence from arbitrary initial points. Empirical results on logistic regression and least-squares problems demonstrate substantial speedups in data passes and competitive wall-clock performance, highlighting SVRN's practicality for large-scale second-order variance-reduced optimization.

Abstract

Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization. Incorporating second-order information has proven helpful in further improving the performance of these first-order methods. Yet, comparatively little is known about the benefits of using variance reduction to accelerate popular stochastic second-order methods such as Subsampled Newton. To address this, we propose Stochastic Variance-Reduced Newton (SVRN), a finite-sum minimization algorithm that provably accelerates existing stochastic Newton methods from $O(α\log(1/ε))$ to $O\big(\frac{\log(1/ε)}{\log(n)}\big)$ passes over the data, i.e., by a factor of $O(α\log(n))$, where $n$ is the number of sum components and $α$ is the approximation factor in the Hessian estimate. Surprisingly, this acceleration gets more significant the larger the data size $n$, which is a unique property of SVRN. Our algorithm retains the key advantages of Newton-type methods, such as easily parallelizable large-batch operations and a simple unit step size. We use SVRN to accelerate Subsampled Newton and Iterative Hessian Sketch algorithms, and show that it compares favorably to popular first-order methods with variance~reduction.

Stochastic Variance-Reduced Newton: Accelerating Finite-Sum Minimization with Large Batches

TL;DR

This work addresses accelerating finite-sum convex minimization by integrating variance reduction with stochastic Newton updates. The authors prove that Stochastic Variance-Reduced Newton (SVRN) achieves a faster sequential convergence rate, reducing data passes to while maintaining a low parallel cost through Hessian-based updates, with the improvement scaling favorably as the dataset size grows. They also introduce SVRN-HA, a globally convergent variant that combines Hessian averaging and a line-search phase with a subsequent SVRN phase, guaranteeing practical convergence from arbitrary initial points. Empirical results on logistic regression and least-squares problems demonstrate substantial speedups in data passes and competitive wall-clock performance, highlighting SVRN's practicality for large-scale second-order variance-reduced optimization.

Abstract

Stochastic variance reduction has proven effective at accelerating first-order algorithms for solving convex finite-sum optimization tasks such as empirical risk minimization. Incorporating second-order information has proven helpful in further improving the performance of these first-order methods. Yet, comparatively little is known about the benefits of using variance reduction to accelerate popular stochastic second-order methods such as Subsampled Newton. To address this, we propose Stochastic Variance-Reduced Newton (SVRN), a finite-sum minimization algorithm that provably accelerates existing stochastic Newton methods from to passes over the data, i.e., by a factor of , where is the number of sum components and is the approximation factor in the Hessian estimate. Surprisingly, this acceleration gets more significant the larger the data size , which is a unique property of SVRN. Our algorithm retains the key advantages of Newton-type methods, such as easily parallelizable large-batch operations and a simple unit step size. We use SVRN to accelerate Subsampled Newton and Iterative Hessian Sketch algorithms, and show that it compares favorably to popular first-order methods with variance~reduction.
Paper Structure (31 sections, 12 theorems, 56 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 31 sections, 12 theorems, 56 equations, 5 figures, 1 table, 1 algorithm.

Key Result

Theorem 1

Suppose that Assumption a:convex holds and: (a) $f$ has a Lipschitz Hessian, or (b) $f$ is self-concordant. Moreover, let $n\gg\kappa\gg\alpha$. There is an algorithm (SVRN) and an open neighborhood $U$ such that, given any $\mathbf x\in U$ with a corresponding Hessian $\alpha$-approximation as in e

Figures (5)

  • Figure 1: Illustration of the local convergence complexity analysis for SVRN, as a function of the mini-batch size $m$, with the number of inner iterations set to $t_{\max}=n/m$. As we decrease the mini-batch size from $n$ (standard Stochastic Newton; SN) downto $m\approx\frac{n}{\alpha\log(n/\kappa)}$ (optimal SVRN), the sequential complexity (number of passes over the data) improves by $O(\alpha\log(n))$, while the parallel complexity (number of batch gradient queries) remains optimal.
  • Figure 2: Convergence and runtime comparison of SVRN-HA against three baselines: classical Newton, SVRG (after parameter tuning), and Subsample Newton with Hessian Averaging (SN-HA), i.e., the initial phase of SVRN-HA ran without variance reduction all the way through.
  • Figure 3: How different types of gradient estimation affect the convergence properties of SVRN.
  • Figure 4: How Hessian sample size and data coherence affect the convergence properties of SVRN.
  • Figure 5: Convergence comparison of SVRN-HA against SN-HA and Newton for a synthetic logistic regression task as we vary the condition number of the data matrix, and for the CIFAR-10 dataset.

Theorems & Definitions (16)

  • Theorem 1: informal Theorem \ref{['t:svrn']}
  • Remark 1
  • Remark 2
  • Theorem 2: Fast least squares solver
  • Definition 1
  • Theorem 3: Convergence rate of SVRN
  • Remark 3
  • Theorem 4
  • Lemma 1
  • Lemma 2
  • ...and 6 more