Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning
Mohamed Elsayed, Homayoon Farrahi, Felix Dangel, A. Rupam Mahmood
TL;DR
The paper tackles the challenge of leveraging second-order information by revisiting deterministic Hessian diagonal approximations. It introduces HesScale, a refinement of the BL89 diagonal scheme that computes exact diagonals for the last layer and propagates diagonal estimates with linear cost, with a Gauss-Newton variant HesScaleGN for further simplification. Across supervised and reinforcement learning tasks involving small networks, HesScale-based methods (AdaHesScale and AdaHesScaleGN) achieve superior approximation quality and faster optimization, while a corresponding step-size scaling mechanism based on the HesScale Hessian enhances robustness and stability in RL. The findings suggest that scalable second-order methods powered by HesScale can improve efficiency and reliability in RL and potentially extend to larger models in the future.
Abstract
Second-order information is valuable for many applications but challenging to compute. Several works focus on computing or approximating Hessian diagonals, but even this simplification introduces significant additional costs compared to computing a gradient. In the absence of efficient exact computation schemes for Hessian diagonals, we revisit an early approximation scheme proposed by Becker and LeCun (1989, BL89), which has a cost similar to gradients and appears to have been overlooked by the community. We introduce HesScale, an improvement over BL89, which adds negligible extra computation. On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute. We use this insight in reinforcement learning problems where small networks are used and demonstrate HesScale in second-order optimization and scaling the step-size parameter. In our experiments, HesScale optimizes faster than existing methods and improves stability through step-size scaling. These findings are promising for scaling second-order methods in larger models in the future.
