Table of Contents
Fetching ...

Comparing BFGS and OGR for Second-Order Optimization

Adrian Przybysz, Mikołaj Kołek, Franciszek Sobota, Jarek Duda

TL;DR

The paper tackles the high cost of second-order optimization in high-dimensional neural networks by comparing traditional BFGS with Online Gradient Regression (OGR), an online curvature estimator that avoids Hessian inversion. OGR regresses gradients against parameter displacements to infer local curvature, optionally enforcing symmetry and operating in a reduced subspace; it can produce negative curvature estimates, enabling saddle-aware updates. Across standard test functions and ablations, OGR demonstrates faster convergence and lower final losses than BFGS, particularly in non-convex landscapes, with line-search offering gains in many cases. The findings suggest OGR as a practical, near-first-order-cost alternative to classical second-order methods, with potential for scaling to neural network training and integration with momentum-based adaptivity.

Abstract

Estimating the Hessian matrix, especially for neural network training, is a challenging problem due to high dimensionality and cost. In this work, we compare the classical Sherman-Morrison update used in the popular BFGS method (Broy-den-Fletcher-Goldfarb-Shanno), which maintains a positive definite Hessian approximation under a convexity assumption, with a novel approach called Online Gradient Regression (OGR). OGR performs regression of gradients against positions using an exponential moving average to estimate second derivatives online, without requiring Hessian inversion. Unlike BFGS, OGR allows estimation of a general (not necessarily positive definite) Hessian and can thus handle non-convex structures. We evaluate both methods across standard test functions and demonstrate that OGR achieves faster convergence and improved loss, particularly in non-convex settings.

Comparing BFGS and OGR for Second-Order Optimization

TL;DR

The paper tackles the high cost of second-order optimization in high-dimensional neural networks by comparing traditional BFGS with Online Gradient Regression (OGR), an online curvature estimator that avoids Hessian inversion. OGR regresses gradients against parameter displacements to infer local curvature, optionally enforcing symmetry and operating in a reduced subspace; it can produce negative curvature estimates, enabling saddle-aware updates. Across standard test functions and ablations, OGR demonstrates faster convergence and lower final losses than BFGS, particularly in non-convex landscapes, with line-search offering gains in many cases. The findings suggest OGR as a practical, near-first-order-cost alternative to classical second-order methods, with potential for scaling to neural network training and integration with momentum-based adaptivity.

Abstract

Estimating the Hessian matrix, especially for neural network training, is a challenging problem due to high dimensionality and cost. In this work, we compare the classical Sherman-Morrison update used in the popular BFGS method (Broy-den-Fletcher-Goldfarb-Shanno), which maintains a positive definite Hessian approximation under a convexity assumption, with a novel approach called Online Gradient Regression (OGR). OGR performs regression of gradients against positions using an exponential moving average to estimate second derivatives online, without requiring Hessian inversion. Unlike BFGS, OGR allows estimation of a general (not necessarily positive definite) Hessian and can thus handle non-convex structures. We evaluate both methods across standard test functions and demonstrate that OGR achieves faster convergence and improved loss, particularly in non-convex settings.

Paper Structure

This paper contains 21 sections, 30 equations, 3 figures, 1 algorithm.

Figures (3)

  • Figure 1: Distribution of starting points for a 2D function. All points were randomly sampled from a uniform distribution within the specified bounds, ensuring consistency across all tests.
  • Figure 2: Performance comparison of BFGS and OGR with and without a line search procedure. The plots show the final loss distribution for 200 random starting points on various test functions.
  • Figure 3: Comparison of optimization for BFGS (top) and OGR (bottom) on selected test functions. OGR converges faster to the global minimum in all cases.