Table of Contents
Fetching ...

Newton Meets Marchenko-Pastur: Massively Parallel Second-Order Optimization with Hessian Sketching and Debiasing

Elad Romanov, Fangzhao Zhang, Mert Pilanci

TL;DR

This work proposes a scheme where the central node (server) effectively runs a Newton method, offloading its high per-iteration cost -- stemming from the need to invert the Hessian -- to the workers.

Abstract

Motivated by recent advances in serverless cloud computing, in particular the "function as a service" (FaaS) model, we consider the problem of minimizing a convex function in a massively parallel fashion, where communication between workers is limited. Focusing on the case of a twice-differentiable objective subject to an L2 penalty, we propose a scheme where the central node (server) effectively runs a Newton method, offloading its high per-iteration cost -- stemming from the need to invert the Hessian -- to the workers. In our solution, workers produce independently coarse but low-bias estimates of the inverse Hessian, using an adaptive sketching scheme. The server then averages the descent directions produced by the workers, yielding a good approximation for the exact Newton step. The main component of our adaptive sketching scheme is a low-complexity procedure for selecting the sketching dimension, an issue that was left largely unaddressed in the existing literature on Hessian sketching for distributed optimization. Our solution is based on ideas from asymptotic random matrix theory, specifically the Marchenko-Pastur law. For Gaussian sketching matrices, we derive non asymptotic guarantees for our algorithm which are essentially dimension-free. Lastly, when the objective is self-concordant, we provide convergence guarantees for the approximate Newton's method with noisy Hessians, which may be of independent interest beyond the setting considered in this paper.

Newton Meets Marchenko-Pastur: Massively Parallel Second-Order Optimization with Hessian Sketching and Debiasing

TL;DR

This work proposes a scheme where the central node (server) effectively runs a Newton method, offloading its high per-iteration cost -- stemming from the need to invert the Hessian -- to the workers.

Abstract

Motivated by recent advances in serverless cloud computing, in particular the "function as a service" (FaaS) model, we consider the problem of minimizing a convex function in a massively parallel fashion, where communication between workers is limited. Focusing on the case of a twice-differentiable objective subject to an L2 penalty, we propose a scheme where the central node (server) effectively runs a Newton method, offloading its high per-iteration cost -- stemming from the need to invert the Hessian -- to the workers. In our solution, workers produce independently coarse but low-bias estimates of the inverse Hessian, using an adaptive sketching scheme. The server then averages the descent directions produced by the workers, yielding a good approximation for the exact Newton step. The main component of our adaptive sketching scheme is a low-complexity procedure for selecting the sketching dimension, an issue that was left largely unaddressed in the existing literature on Hessian sketching for distributed optimization. Our solution is based on ideas from asymptotic random matrix theory, specifically the Marchenko-Pastur law. For Gaussian sketching matrices, we derive non asymptotic guarantees for our algorithm which are essentially dimension-free. Lastly, when the objective is self-concordant, we provide convergence guarantees for the approximate Newton's method with noisy Hessians, which may be of independent interest beyond the setting considered in this paper.
Paper Structure (40 sections, 36 theorems, 172 equations, 4 figures, 2 tables, 1 algorithm)

This paper contains 40 sections, 36 theorems, 172 equations, 4 figures, 2 tables, 1 algorithm.

Key Result

Theorem 1

Set $\eta,D>0$. Assume [Asymp($\tau,H,\xi$)]. W.p. $1-O(m^{-D})$, simultaneously for all $-1/\tau \le z \le -\tau$, The constants in the $O(\cdot)$ notation depend on $\tau,\eta,D$.

Figures (4)

  • Figure 1: The bias proxy of the bias-corrected inverse Hessian estimator is substantially lower than without bias correction. Rightmost plot: ensemble (R); left and middle plots: ensemble (L). Leftmost plot: the sketching dimension found by Algorithm \ref{['Alg:1']} on ensemble (L). Shaded area: 20%-80% confidence interval; We take $T=10$ Monte-Carlo trials.
  • Figure 2: Improved convergence of our parallel sketched Newton method with bias correction. The title of sub-figure corresponds to the dataset used. Repeating for $T=10$ Monte-Carlo trials, the curve corresponds to the median and the shaded part to a $20\%$-$80\%$ confidence interval.
  • Figure 3: Convergence of the parallel Newton method with Hessian sketching, on synthetic optimization tasks.
  • Figure 4: Convergence rate for optimization tasks on additional UCI data sets. Shaded area corresponds to 20%-80% confidence interval.

Theorems & Definitions (71)

  • Remark 1
  • Remark 2
  • Definition 1
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Theorem 3
  • proof
  • Lemma 1
  • ...and 61 more