Second-Order Forward-Mode Automatic Differentiation for Optimization

Adam D. Cobb; Atılım Güneş Baydin; Barak A. Pearlmutter; Susmit Jha

Second-Order Forward-Mode Automatic Differentiation for Optimization

Adam D. Cobb, Atılım Güneş Baydin, Barak A. Pearlmutter, Susmit Jha

TL;DR

A second-order hyperplane search is introduced, a novel optimization step that generalizes a second-order line search from a line to a $k$-dimensional hyperplane and yields a second-order optimization algorithm that consists of forward passes only, completely avoiding the storage overhead of backpropagation.

Abstract

This paper introduces a second-order hyperplane search, a novel optimization step that generalizes a second-order line search from a line to a $k$-dimensional hyperplane. This, combined with the forward-mode stochastic gradient method, yields a second-order optimization algorithm that consists of forward passes only, completely avoiding the storage overhead of backpropagation. Unlike recent work that relies on directional derivatives (or Jacobian--Vector Products, JVPs), we use hyper-dual numbers to jointly evaluate both directional derivatives and their second-order quadratic terms. As a result, we introduce forward-mode weight perturbation with Hessian information (FoMoH). We then use FoMoH to develop a novel generalization of line search by extending it to a hyperplane search. We illustrate the utility of this extension and how it might be used to overcome some of the recent challenges of optimizing machine learning models without backpropagation. Our code is open-sourced at https://github.com/SRI-CSL/fomoh.

Second-Order Forward-Mode Automatic Differentiation for Optimization

TL;DR

A second-order hyperplane search is introduced, a novel optimization step that generalizes a second-order line search from a line to a

-dimensional hyperplane and yields a second-order optimization algorithm that consists of forward passes only, completely avoiding the storage overhead of backpropagation.

Abstract

This paper introduces a second-order hyperplane search, a novel optimization step that generalizes a second-order line search from a line to a

-dimensional hyperplane. This, combined with the forward-mode stochastic gradient method, yields a second-order optimization algorithm that consists of forward passes only, completely avoiding the storage overhead of backpropagation. Unlike recent work that relies on directional derivatives (or Jacobian--Vector Products, JVPs), we use hyper-dual numbers to jointly evaluate both directional derivatives and their second-order quadratic terms. As a result, we introduce forward-mode weight perturbation with Hessian information (FoMoH). We then use FoMoH to develop a novel generalization of line search by extending it to a hyperplane search. We illustrate the utility of this extension and how it might be used to overcome some of the recent challenges of optimizing machine learning models without backpropagation. Our code is open-sourced at https://github.com/SRI-CSL/fomoh.

Paper Structure (23 sections, 7 equations, 7 figures, 4 tables, 2 algorithms)

This paper contains 23 sections, 7 equations, 7 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Automatic Differentiation
Higher-Order Forward Mode Automatic Differentiation
Implications of Hyper-Dual Numbers for Machine Learning
Local Curvature: $\mathbf{v}_1^{\top} \nabla^2 f(\bm{\theta}) \mathbf{v}_2$.
Computational Cost.
Forward-Mode Optimization with Second Order Information
Forward-Mode Line Search: FoMoH
Forward-Mode Line Search with Backpropagation: FoMoH-BP
Forward-Mode Hyperplane Search: FoMoH-$K$D
Experiments
Rosenbrock Function
10D Rosenbrock Function.
Logistic Regression
...and 8 more sections

Figures (7)

Figure 1: Results over the 2D Rosenbrock function.
Figure 2: Performance of FoMoH-$K$D for $K=2\dots10$ on the 10D Rosenbrock function. Solid lines represent the median, with transparent lines corresponding to the each of the 10 random seeds. There is a clear pattern of higher dimensions performing better, with the performance of $K=10$ coinciding with Newton's Method (black dotted line).
Figure 3: Forward-mode training and validation curves for the logistic regression model on the MNIST dataset. Average and standard deviation is shown for five random initializations.
Figure 4: Forward-mode training and validation curves for the CNN on the MNIST dataset. Average and standard deviation is shown for three random initializations. Note how FGD (blue) is much slower to converge, with FoMoH-$K$D improving in performance with increasing $K$.
Figure 5: Histogram over expected step taken by the stochastic approaches of FoMoH, FGD, and FoMoH-2D corresponding to Figure \ref{['fig:single_step']}. Noteworthy is that the variance of the 2D hyperplane search step is significantly smaller and expectation is close to Newton step.
...and 2 more figures

Second-Order Forward-Mode Automatic Differentiation for Optimization

TL;DR

Abstract

Second-Order Forward-Mode Automatic Differentiation for Optimization

Authors

TL;DR

Abstract

Table of Contents

Figures (7)