Comparing BFGS and OGR for Second-Order Optimization
Adrian Przybysz, Mikołaj Kołek, Franciszek Sobota, Jarek Duda
TL;DR
The paper tackles the high cost of second-order optimization in high-dimensional neural networks by comparing traditional BFGS with Online Gradient Regression (OGR), an online curvature estimator that avoids Hessian inversion. OGR regresses gradients against parameter displacements to infer local curvature, optionally enforcing symmetry and operating in a reduced subspace; it can produce negative curvature estimates, enabling saddle-aware updates. Across standard test functions and ablations, OGR demonstrates faster convergence and lower final losses than BFGS, particularly in non-convex landscapes, with line-search offering gains in many cases. The findings suggest OGR as a practical, near-first-order-cost alternative to classical second-order methods, with potential for scaling to neural network training and integration with momentum-based adaptivity.
Abstract
Estimating the Hessian matrix, especially for neural network training, is a challenging problem due to high dimensionality and cost. In this work, we compare the classical Sherman-Morrison update used in the popular BFGS method (Broy-den-Fletcher-Goldfarb-Shanno), which maintains a positive definite Hessian approximation under a convexity assumption, with a novel approach called Online Gradient Regression (OGR). OGR performs regression of gradients against positions using an exponential moving average to estimate second derivatives online, without requiring Hessian inversion. Unlike BFGS, OGR allows estimation of a general (not necessarily positive definite) Hessian and can thus handle non-convex structures. We evaluate both methods across standard test functions and demonstrate that OGR achieves faster convergence and improved loss, particularly in non-convex settings.
