Table of Contents
Fetching ...

Optimal Shrinkage for Distributed Second-Order Optimization

Fangzhao Zhang, Mert Pilanci

TL;DR

A novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased is introduced, and its non-asymptotic convergence rate in the isotropic case is characterized.

Abstract

In this work, we address the problem of Hessian inversion bias in distributed second-order optimization algorithms. We introduce a novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased, and characterize its non-asymptotic convergence rate in the isotropic case. We apply this estimator to bias correction of Newton steps in distributed second-order optimization algorithms, as well as randomized sketching based methods. We examine the bias present in the naive averaging-based distributed Newton's method using analytical expressions and contrast it with our proposed bias-free approach. Our approach leads to significant improvements in convergence rate compared to standard baselines and recent proposals, as shown through experiments on both real and synthetic datasets.

Optimal Shrinkage for Distributed Second-Order Optimization

TL;DR

A novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased is introduced, and its non-asymptotic convergence rate in the isotropic case is characterized.

Abstract

In this work, we address the problem of Hessian inversion bias in distributed second-order optimization algorithms. We introduce a novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased, and characterize its non-asymptotic convergence rate in the isotropic case. We apply this estimator to bias correction of Newton steps in distributed second-order optimization algorithms, as well as randomized sketching based methods. We examine the bias present in the naive averaging-based distributed Newton's method using analytical expressions and contrast it with our proposed bias-free approach. Our approach leads to significant improvements in convergence rate compared to standard baselines and recent proposals, as shown through experiments on both real and synthetic datasets.
Paper Structure (43 sections, 19 theorems, 88 equations, 15 figures, 2 tables, 2 algorithms)

This paper contains 43 sections, 19 theorems, 88 equations, 15 figures, 2 tables, 2 algorithms.

Key Result

Theorem 1

(informal, see Theorem thm2 for the assumptions) For a random data matrix $A\in \mathbb{R}^{n\times d}$, let $d_\lambda$ denote the effective dimension of the true covariance matrix $\Sigma_n$, and $\hat{\Sigma}_n$ denote the empirical covariance matrix. Then, we have where $\gamma=\frac{1}{1-\frac{d_\lambda}{n}}$.

Figures (15)

  • Figure 1: Synthetic data experiments on ridge regression. Total number of data $n=30000$, data dimension $d=150$, number of agents $m=200$, regularizer $\lambda=0.01$. The left plot shows the convergence of distributed Newton's method (Algorithm \ref{['alg:newton']}). The right plot shows the convergence of distributed inexact Newton's method (Algorithm \ref{['alg:pcg']}). Step sizes are chosen via line search in all methods. See Section \ref{['simu']} for details.
  • Figure 2: Experiments with real data on covariance resolvent estimation. The dataset is split evenly to each agent. We let $\lambda=0.001$ and $\Sigma=\frac{1}{n}A^TA$. The relative matrix spectral norm difference between true covariance resolvent $R$ and estimated covariance resolvent $\tilde{R}$ is plotted, see Section \ref{['section5.1']} for details.
  • Figure 3: Experiments with real data on distributed Newton's method applied to ridge regression. Line search is used in all methods to determine the step sizes. Number of total samples is rounded down to a multiple of the number of agents and split evenly to each agent. We let $m=100,\lambda=0.1$ for segment, $m=20,\lambda=0.05$ for bodyfat, $m=20,\lambda=0.5$ for eunite2001, $m=100,\lambda=0.01$ for pendigits, where $\lambda$ is the regularization parameter and $m$ denotes the number of agents.
  • Figure 4: Experiments with real data on Iterative Hessian Sketch method applied to ridge regression. Line search is used to determine the step sizes. We let $\lambda=0.01$ for bodyfat, housing, mpg and $\lambda=0.001$ for triazines; $m=100$ for bodyfat, $m=50$ for housing, $m=30$ for mpg, $m=300$ for triazines where $\lambda$ denotes the regularization parameter and $m$ denotes the sketch size.
  • Figure 5: Synthetic data experiments on covariance resolvent estimation. Let $m$ denote the number of agents, $d$ denote data dimension, and $\lambda$ denote the regularizer. We take $m=100, d=10, \lambda=0.1$. Data is i.i.d. $\mathcal{N}(0,\Sigma)$ with $\Sigma=0.1I$ in the left plot, and $\Sigma=100C^TC, C_{ij}\sim U(0,1)$ in the right plot.
  • ...and 10 more figures

Theorems & Definitions (41)

  • Theorem
  • Theorem 2.2
  • proof
  • Theorem 2.3
  • proof
  • Theorem 2.4
  • proof
  • Theorem 2.5
  • proof
  • Theorem 3.1
  • ...and 31 more