Table of Contents
Fetching ...

FOSI: Hybrid First and Second Order Optimization

Hadar Sivan, Moshe Gabel, Assaf Schuster

TL;DR

This work presents FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process, and outperforms second-order methods (K-FAC and L-BFGS).

Abstract

Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).

FOSI: Hybrid First and Second Order Optimization

TL;DR

This work presents FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process, and outperforms second-order methods (K-FAC and L-BFGS).

Abstract

Popular machine learning approaches forgo second-order information due to the difficulty of computing curvature in high dimensions. We present FOSI, a novel meta-algorithm that improves the performance of any base first-order optimizer by efficiently incorporating second-order information during the optimization process. In each iteration, FOSI implicitly splits the function into two quadratic functions defined on orthogonal subspaces, then uses a second-order method to minimize the first, and the base optimizer to minimize the other. We formally analyze FOSI's convergence and the conditions under which it improves a base optimizer. Our empirical evaluation demonstrates that FOSI improves the convergence rate and optimization time of first-order methods such as Heavy-Ball and Adam, and outperforms second-order methods (K-FAC and L-BFGS).
Paper Structure (30 sections, 5 theorems, 17 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 30 sections, 5 theorems, 17 equations, 9 figures, 2 tables, 2 algorithms.

Key Result

Lemma 1

Let $f(\theta)$ be a convex twice differential function and let BaseOpt be a first-order optimizer that utilizes a positive definite diagonal preconditioner. Let $H$ be $f$'s Hessian at iteration $t$ of FOSI with BaseOpt, and let $V \operatorname{diag}(\boldsymbol{\lambda}) V^T$ be an eigendecomposi

Figures (9)

  • Figure 1: FOSI's update steps (arrows) when minimizing a quadratic function $f(\theta)$. FOSI implicitly separates the space into two orthogonal complement subspaces and then splits the original function $f$ into two functions $f_1$ and $f_2$ over these subspaces, such that $f = f_1 + f_2$. FOSI solves $\min f$ by simultaneously solving $\min f_1$ with Newton's method and $\min f_2$ with the base optimizer. The update step is the sum of $d_1$ and $d_2$, the updates to $f_1$ and $f_2$ respectively.
  • Figure 2: Wall time in seconds to reach target validation accuracy (AC, TL, LR) or loss (LM, AE). The target (in parentheses) is the best one reached by the base optimizer. No single base optimizer is best for all tasks.
  • Figure 3: Learning curves for minimizing PD quadratic functions $f_H(\theta) = 0.5 \theta^T H \theta$ with varying $n$ and $\lambda_1$ values. FOSI converges more than two orders of magnitude faster than its counterparts.
  • Figure 4: Learning curves of GD and FOSI for the minimization of the quadratic function $f(\theta) = 0.5 \theta^T H \theta$, with $\theta \in \mathbb{R}^{100}$. $H$'s eigenvectors are a random orthogonal basis, $\eta=0.001, \lambda_1=10, \lambda_{10} = 9, \lambda_n = 0.01$, and FOSI runs with $k=9, \ell = 0, \alpha=1$. While FOSI's effective condition number is larger than the original one, it converges much faster than the base optimizer.
  • Figure 5: Learning curves of different optimizers for different DNN training tasks. In most cases FOSI obtains faster convergence than the base optimizers across epochs (left) and across wall time (middle). Since FOSI accelerates convergence, it also leads to an earlier overfitting point when the base optimizer has a tendency to overfit, as can be observed in the LR validation loss (right).
  • ...and 4 more figures

Theorems & Definitions (9)

  • Lemma 1
  • Lemma 2
  • proof
  • Lemma 3
  • proof
  • Lemma 4
  • proof
  • Theorem 5: Theorem 2.8, doi:10.1137/15M1053141
  • proof