FOSI: Hybrid First and Second Order Optimization

Hadar Sivan; Moshe Gabel; Assaf Schuster

FOSI: Hybrid First and Second Order Optimization

Hadar Sivan, Moshe Gabel, Assaf Schuster

TL;DR

This work presents FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process, and outperforms second-order methods (K-FAC and L-BFGS).

Abstract

Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).

FOSI: Hybrid First and Second Order Optimization

TL;DR

Abstract

Paper Structure (30 sections, 5 theorems, 17 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 5 theorems, 17 equations, 9 figures, 2 tables, 2 algorithms.

Introduction
Background and Notation
First and Second-Order Integration
Extreme Spectrum Estimation (ESE)
The FOSI Optimizer
Preconditioner Analysis
Momentum
Convergence in the Stochastic Setting
Automatic Learning Rate Scaling
Error and Overhead
Evaluation
Deep Neural Networks
Comparison to Second-Order Methods
Quadratic Functions
Related Work
...and 15 more sections

Key Result

Lemma 1

Let $f(\theta)$ be a convex twice differential function and let BaseOpt be a first-order optimizer that utilizes a positive definite diagonal preconditioner. Let $H$ be $f$'s Hessian at iteration $t$ of FOSI with BaseOpt, and let $V \operatorname{diag}(\boldsymbol{\lambda}) V^T$ be an eigendecomposi

Figures (9)

Figure 1: FOSI's update steps (arrows) when minimizing a quadratic function $f(\theta)$. FOSI implicitly separates the space into two orthogonal complement subspaces and then splits the original function $f$ into two functions $f_1$ and $f_2$ over these subspaces, such that $f = f_1 + f_2$. FOSI solves $\min f$ by simultaneously solving $\min f_1$ with Newton's method and $\min f_2$ with the base optimizer. The update step is the sum of $d_1$ and $d_2$, the updates to $f_1$ and $f_2$ respectively.
Figure 2: Wall time in seconds to reach target validation accuracy (AC, TL, LR) or loss (LM, AE). The target (in parentheses) is the best one reached by the base optimizer. No single base optimizer is best for all tasks.
Figure 3: Learning curves for minimizing PD quadratic functions $f_H(\theta) = 0.5 \theta^T H \theta$ with varying $n$ and $\lambda_1$ values. FOSI converges more than two orders of magnitude faster than its counterparts.
Figure 4: Learning curves of GD and FOSI for the minimization of the quadratic function $f(\theta) = 0.5 \theta^T H \theta$, with $\theta \in \mathbb{R}^{100}$. $H$'s eigenvectors are a random orthogonal basis, $\eta=0.001, \lambda_1=10, \lambda_{10} = 9, \lambda_n = 0.01$, and FOSI runs with $k=9, \ell = 0, \alpha=1$. While FOSI's effective condition number is larger than the original one, it converges much faster than the base optimizer.
Figure 5: Learning curves of different optimizers for different DNN training tasks. In most cases FOSI obtains faster convergence than the base optimizers across epochs (left) and across wall time (middle). Since FOSI accelerates convergence, it also leads to an earlier overfitting point when the base optimizer has a tendency to overfit, as can be observed in the LR validation loss (right).
...and 4 more figures

Theorems & Definitions (9)

Lemma 1
Lemma 2
proof
Lemma 3
proof
Lemma 4
proof
Theorem 5: Theorem 2.8, doi:10.1137/15M1053141
proof

FOSI: Hybrid First and Second Order Optimization

TL;DR

Abstract

FOSI: Hybrid First and Second Order Optimization

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (9)