Series of Hessian-Vector Products for Tractable Saddle-Free Newton Optimisation of Neural Networks
Elre T. Oldewage, Ross M. Clarke, José Miguel Hernández-Lobato
TL;DR
This work tackles the challenge of applying second-order optimisation to deep neural networks by proposing a scalable SFN-like method that avoids explicit Hessian storage or eigendecomposition. It constructs the absolute-value transformation of the Hessian eigenvalues via the principal square root of $H^2$ and approximates $(H^2)^{-rac{1}{2}}$ with a convergent matrix series, using Hessian-vector products to apply the update efficiently. A key contribution is the combination of an adaptive scaling factor $V$ with sequence acceleration (Wynn’s epsilon) to enable practical truncation while preserving convergence properties toward saddle-free minimisers. Empirical results across UCI Energy and large-scale datasets with ResNet-18 demonstrate that the method is competitive with established optimisers and particularly benefits from adaptivity, although exact SFN and well-tuned KFAC/Adam variants can still outperform in some settings. The approach opens avenues for further improvements via conditioning strategies and CG-inspired techniques to close the remaining gap to state-of-the-art first- and second-order optimisers.
Abstract
Despite their popularity in the field of continuous optimisation, second-order quasi-Newton methods are challenging to apply in machine learning, as the Hessian matrix is intractably large. This computational burden is exacerbated by the need to address non-convexity, for instance by modifying the Hessian's eigenvalues as in Saddle-Free Newton methods. We propose an optimisation algorithm which addresses both of these concerns - to our knowledge, the first efficiently-scalable optimisation algorithm to asymptotically use the exact inverse Hessian with absolute-value eigenvalues. Our method frames the problem as a series which principally square-roots and inverts the squared Hessian, then uses it to precondition a gradient vector, all without explicitly computing or eigendecomposing the Hessian. A truncation of this infinite series provides a new optimisation algorithm which is scalable and comparable to other first- and second-order optimisation methods in both runtime and optimisation performance. We demonstrate this in a variety of settings, including a ResNet-18 trained on CIFAR-10.
