Efficient Stochastic BFGS methods Inspired by Bayesian Principles
André Carlon, Luis Espath, Raúl Tempone
TL;DR
The paper addresses the challenge of incorporating second-order information in stochastic optimization by deriving Bayesian-based stochastic quasi-Newton methods. It formulates a probabilistic model over inverse Hessians using curvature pairs and yields the S-BFGS and its memory-efficient variant L-S-BFGS, with closed-form updates and curvature rules to suppress noise amplification. A convergence analysis shows that preconditioned SGD with these updates achieves provable progress under standard assumptions, and experiments on a high-dimensional quadratic problem and large-scale logistic regression demonstrate robustness and scalability, notably with L-S-BFGS outperforming established baselines. The work offers a principled, computationally efficient pathway to leverage curvature information in stochastic contexts and suggests integration with variance-reduction techniques as a future direction.
Abstract
Quasi-Newton methods are ubiquitous in deterministic local search due to their efficiency and low computational cost. This class of methods uses the history of gradient evaluations to approximate second-order derivatives. However, only noisy gradient observations are accessible in stochastic optimization; thus, deriving quasi-Newton methods in this setting is challenging. Although most existing quasi-Newton methods for stochastic optimization rely on deterministic equations that are modified to circumvent noise, we propose a new approach inspired by Bayesian inference to assimilate noisy gradient information and derive the stochastic counterparts to standard quasi-Newton methods. We focus on the derivations of stochastic BFGS and L-BFGS, but our methodology can also be employed to derive stochastic analogs of other quasi-Newton methods. The resulting stochastic BFGS (S-BFGS) and stochastic L-BFGS (L-S-BFGS) can effectively learn an inverse Hessian approximation even with small batch sizes. For a problem of dimension $d$, the iteration cost of S-BFGS is $\mathcal{O}(d^2)$, and the cost of L-S-BFGS is $\mathcal{O}(d)$. Numerical experiments with a dimensionality of up to $30,720$ demonstrate the efficiency and robustness of the proposed method.
