AdaFisher: Adaptive Second Order Optimization via Fisher Information
Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini
TL;DR
The paper tackles the bottleneck that second-order optimizers, while offering faster convergence and better generalization, are often impractical for large DNNs due to cost. It introduces AdaFisher, an adaptive optimizer that uses a diagonal block-Kronecker approximation of the Fisher Information Matrix (FIM) to precondition gradients, integrated into an Adam-like framework while avoiding costly square-root operations. Empirically, AdaFisher achieves state-of-the-art or near-SOTA results on image classification (including ImageNet-1k with a single GPU) and language modeling, with strong stability across hyper-parameters and scalable multi-GPU performance. The work provides convergence guarantees for convex and non-convex objectives and demonstrates that diagonal Kronecker factors capture essential curvature information, while EFIM for normalization layers enhances generalization. Overall, AdaFisher offers a practical path to leverage second-order information in large-scale DNNs with controlled computational requirements.
Abstract
First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs is still limited due to increased per-iteration computations compared to the first-order methods. We present \emph{AdaFisher}--an adaptive second-order optimizer that leverages a \emph{diagonal block-Kronecker} approximation of the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced \emph{convergence/generalization} capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modeling and stands out for its stability and robustness in hyper-parameter tuning. We demonstrate that AdaFisher \textbf{outperforms the SOTA optimizers} in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
