Optimal Shrinkage for Distributed Second-Order Optimization

Fangzhao Zhang; Mert Pilanci

Optimal Shrinkage for Distributed Second-Order Optimization

Fangzhao Zhang, Mert Pilanci

TL;DR

A novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased is introduced, and its non-asymptotic convergence rate in the isotropic case is characterized.

Abstract

In this work, we address the problem of Hessian inversion bias in distributed second-order optimization algorithms. We introduce a novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased, and characterize its non-asymptotic convergence rate in the isotropic case. We apply this estimator to bias correction of Newton steps in distributed second-order optimization algorithms, as well as randomized sketching based methods. We examine the bias present in the naive averaging-based distributed Newton's method using analytical expressions and contrast it with our proposed bias-free approach. Our approach leads to significant improvements in convergence rate compared to standard baselines and recent proposals, as shown through experiments on both real and synthetic datasets.

Optimal Shrinkage for Distributed Second-Order Optimization

TL;DR

A novel shrinkage-based estimator for the resolvent of gram matrices which is asymptotically unbiased is introduced, and its non-asymptotic convergence rate in the isotropic case is characterized.

Abstract

Paper Structure (43 sections, 19 theorems, 88 equations, 15 figures, 2 tables, 2 algorithms)

This paper contains 43 sections, 19 theorems, 88 equations, 15 figures, 2 tables, 2 algorithms.

Introduction
Prior Work
Contribution
Main Theorems
Asymptotically Unbiased Shrinkage Formula for the Resolvent of Covariance
Isotropic Convergence Rate
Small Regularizer Regime
Asymptotic Bias of the Naive Averaging Method
Application to Distributed Second-Order Optimization Algorithms
Convergence Analysis for Regularized Quadratic Loss
Convergence Analysis for Regularized General Convex Smooth Loss
Communication and Computation Complexity Analysis
Application to Randomized Second-Order Optimization Algorithms
Numerical Simulation
Estimation of the Effective Dimension
...and 28 more sections

Key Result

Theorem 1

(informal, see Theorem thm2 for the assumptions) For a random data matrix $A\in \mathbb{R}^{n\times d}$, let $d_\lambda$ denote the effective dimension of the true covariance matrix $\Sigma_n$, and $\hat{\Sigma}_n$ denote the empirical covariance matrix. Then, we have where $\gamma=\frac{1}{1-\frac{d_\lambda}{n}}$.

Figures (15)

Figure 1: Synthetic data experiments on ridge regression. Total number of data $n=30000$, data dimension $d=150$, number of agents $m=200$, regularizer $\lambda=0.01$. The left plot shows the convergence of distributed Newton's method (Algorithm \ref{['alg:newton']}). The right plot shows the convergence of distributed inexact Newton's method (Algorithm \ref{['alg:pcg']}). Step sizes are chosen via line search in all methods. See Section \ref{['simu']} for details.
Figure 2: Experiments with real data on covariance resolvent estimation. The dataset is split evenly to each agent. We let $\lambda=0.001$ and $\Sigma=\frac{1}{n}A^TA$. The relative matrix spectral norm difference between true covariance resolvent $R$ and estimated covariance resolvent $\tilde{R}$ is plotted, see Section \ref{['section5.1']} for details.
Figure 3: Experiments with real data on distributed Newton's method applied to ridge regression. Line search is used in all methods to determine the step sizes. Number of total samples is rounded down to a multiple of the number of agents and split evenly to each agent. We let $m=100,\lambda=0.1$ for segment, $m=20,\lambda=0.05$ for bodyfat, $m=20,\lambda=0.5$ for eunite2001, $m=100,\lambda=0.01$ for pendigits, where $\lambda$ is the regularization parameter and $m$ denotes the number of agents.
Figure 4: Experiments with real data on Iterative Hessian Sketch method applied to ridge regression. Line search is used to determine the step sizes. We let $\lambda=0.01$ for bodyfat, housing, mpg and $\lambda=0.001$ for triazines; $m=100$ for bodyfat, $m=50$ for housing, $m=30$ for mpg, $m=300$ for triazines where $\lambda$ denotes the regularization parameter and $m$ denotes the sketch size.
Figure 5: Synthetic data experiments on covariance resolvent estimation. Let $m$ denote the number of agents, $d$ denote data dimension, and $\lambda$ denote the regularizer. We take $m=100, d=10, \lambda=0.1$. Data is i.i.d. $\mathcal{N}(0,\Sigma)$ with $\Sigma=0.1I$ in the left plot, and $\Sigma=100C^TC, C_{ij}\sim U(0,1)$ in the right plot.
...and 10 more figures

Theorems & Definitions (41)

Theorem
Theorem 2.2
proof
Theorem 2.3
proof
Theorem 2.4
proof
Theorem 2.5
proof
Theorem 3.1
...and 31 more

Optimal Shrinkage for Distributed Second-Order Optimization

TL;DR

Abstract

Optimal Shrinkage for Distributed Second-Order Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (41)