Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks

Elre T. Oldewage; Ross M. Clarke; José Miguel Hernández-Lobato

Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks

Elre T. Oldewage, Ross M. Clarke, José Miguel Hernández-Lobato

TL;DR

This work tackles the challenge of applying second-order optimisation to deep neural networks by proposing a scalable SFN-like method that avoids explicit Hessian storage or eigendecomposition. It constructs the absolute-value transformation of the Hessian eigenvalues via the principal square root of $H^2$ and approximates $(H^2)^{-rac{1}{2}}$ with a convergent matrix series, using Hessian-vector products to apply the update efficiently. A key contribution is the combination of an adaptive scaling factor $V$ with sequence acceleration (Wynn’s epsilon) to enable practical truncation while preserving convergence properties toward saddle-free minimisers. Empirical results across UCI Energy and large-scale datasets with ResNet-18 demonstrate that the method is competitive with established optimisers and particularly benefits from adaptivity, although exact SFN and well-tuned KFAC/Adam variants can still outperform in some settings. The approach opens avenues for further improvements via conditioning strategies and CG-inspired techniques to close the remaining gap to state-of-the-art first- and second-order optimisers.

Abstract

Despite their popularity in the field of continuous optimisation, second-order quasi-Newton methods are challenging to apply in machine learning, as the Hessian matrix is intractably large. This computational burden is exacerbated by the need to address non-convexity, for instance by modifying the Hessian's eigenvalues as in Saddle-Free Newton methods. We propose an optimisation algorithm which addresses both of these concerns - to our knowledge, the first efficiently-scalable optimisation algorithm to asymptotically use the exact inverse Hessian with absolute-value eigenvalues. Our method frames the problem as a series which principally square-roots and inverts the squared Hessian, then uses it to precondition a gradient vector, all without explicitly computing or eigendecomposing the Hessian. A truncation of this infinite series provides a new optimisation algorithm which is scalable and comparable to other first- and second-order optimisation methods in both runtime and optimisation performance. We demonstrate this in a variety of settings, including a ResNet-18 trained on CIFAR-10.

Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks

TL;DR

and approximates

with a convergent matrix series, using Hessian-vector products to apply the update efficiently. A key contribution is the combination of an adaptive scaling factor

with sequence acceleration (Wynn’s epsilon) to enable practical truncation while preserving convergence properties toward saddle-free minimisers. Empirical results across UCI Energy and large-scale datasets with ResNet-18 demonstrate that the method is competitive with established optimisers and particularly benefits from adaptivity, although exact SFN and well-tuned KFAC/Adam variants can still outperform in some settings. The approach opens avenues for further improvements via conditioning strategies and CG-inspired techniques to close the remaining gap to state-of-the-art first- and second-order optimisers.

Abstract

Paper Structure (34 sections, 1 theorem, 53 equations, 10 figures, 11 tables, 2 algorithms)

This paper contains 34 sections, 1 theorem, 53 equations, 10 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Derivations
Preliminaries
Absolute Values as Square-Rooted Squares
Inverse Square Root Series
Hessian Products, Choice of $V$ and Series Acceleration
Experiments
UCI Energy
Larger Scale Experiments
Discussion
Conclusions
Empirical Notes
Datasets Used
Computing Resources Used
...and 19 more sections

Key Result

Theorem 1

Given Assumptions asssumption:Lipschitz and assumption:InvertibleHessian, suppose that $\left\Vert \mathbf{\bm{x}}_t - \mathbf{\bm{x}}_C \right\Vert < \beta \left\Vert \mathbf{\bm{g}}(\mathbf{\bm{x}}_t) \right\Vert$ where $\mathbf{\bm{x}}_C$ is a critical point and let $\overline{\mathbf{\bm{H}}}(\m where $D = \frac{L C_\delta^2 + 4 L C_\delta \beta}{2}$.

Figures (10)

Figure 1: Motivation for Saddle-Free Newton methods. This locally quadratic surface has a saddle point () and its Hessian gives two principal directions of curvature (, ). From any initial point (), SGD will give an update neglecting curvature () and Newton's method converges immediately to the saddle point (). Exact Saddle-Free Newton () takes absolute values of the Hessian eigenvalues, negating the components of the Newton update in concave directions () and thus changing the saddle point from an attractor to a repeller. Our series-based method () is an approximate Saddle-Free Newton algorithm which converges to the exact Saddle-Free Newton result.
Figure 2: Median training (left) and test (right) MSEs achieved over wall-clock time (top) and training iterations (bottom) on UCI Energy by various optimisers in the full-batch setting, bootstrap-sampled from 50 random seeds. Optimal hyperparameters were tuned with ASHA. Note the logarithmic horizontal axes.
Figure 3: Median training (left) and test (right) MSEs plotted against the log of wall-clock time. The top row includes all additional optimisers; the bottom row excludes KFAC (Kazuki) for clarity. Results are on UCI Energy in the full-batch setting and are bootstrap-sampled from 50 random seeds. Optimal hyperparameters were tuned with ASHA.
Figure 4: Median training (left) and test (right) loss achieved on Fashion-MNIST (top), SVHN (centre) and CIFAR-10 (bottom) by various optimisers using the optimal hyperparameters chosen by ASHA. Values are bootstrap-sampled from 50 random seeds.
Figure 5: Ranking of optimisers according to lowest training (left) and test (right) losses achieved on Fashion-MNIST (top), SVHN (centre) and CIFAR-10 (bottom). Error bars show standard error in the mean. Values are the minimum of the loss profile across time, generated by bootstrap sampling from 50 random seeds.
...and 5 more figures

Theorems & Definitions (2)

Theorem 1
proof

Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks

TL;DR

Abstract

Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (2)