Table of Contents
Fetching ...

AdaFisher: Adaptive Second Order Optimization via Fisher Information

Damien Martins Gomes, Yanlei Zhang, Eugene Belilovsky, Guy Wolf, Mahdi S. Hosseini

TL;DR

The paper tackles the bottleneck that second-order optimizers, while offering faster convergence and better generalization, are often impractical for large DNNs due to cost. It introduces AdaFisher, an adaptive optimizer that uses a diagonal block-Kronecker approximation of the Fisher Information Matrix (FIM) to precondition gradients, integrated into an Adam-like framework while avoiding costly square-root operations. Empirically, AdaFisher achieves state-of-the-art or near-SOTA results on image classification (including ImageNet-1k with a single GPU) and language modeling, with strong stability across hyper-parameters and scalable multi-GPU performance. The work provides convergence guarantees for convex and non-convex objectives and demonstrates that diagonal Kronecker factors capture essential curvature information, while EFIM for normalization layers enhances generalization. Overall, AdaFisher offers a practical path to leverage second-order information in large-scale DNNs with controlled computational requirements.

Abstract

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs is still limited due to increased per-iteration computations compared to the first-order methods. We present \emph{AdaFisher}--an adaptive second-order optimizer that leverages a \emph{diagonal block-Kronecker} approximation of the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced \emph{convergence/generalization} capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modeling and stands out for its stability and robustness in hyper-parameter tuning. We demonstrate that AdaFisher \textbf{outperforms the SOTA optimizers} in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.

AdaFisher: Adaptive Second Order Optimization via Fisher Information

TL;DR

The paper tackles the bottleneck that second-order optimizers, while offering faster convergence and better generalization, are often impractical for large DNNs due to cost. It introduces AdaFisher, an adaptive optimizer that uses a diagonal block-Kronecker approximation of the Fisher Information Matrix (FIM) to precondition gradients, integrated into an Adam-like framework while avoiding costly square-root operations. Empirically, AdaFisher achieves state-of-the-art or near-SOTA results on image classification (including ImageNet-1k with a single GPU) and language modeling, with strong stability across hyper-parameters and scalable multi-GPU performance. The work provides convergence guarantees for convex and non-convex objectives and demonstrates that diagonal Kronecker factors capture essential curvature information, while EFIM for normalization layers enhances generalization. Overall, AdaFisher offers a practical path to leverage second-order information in large-scale DNNs with controlled computational requirements.

Abstract

First-order optimization methods are currently the mainstream in training deep neural networks (DNNs). Optimizers like Adam incorporate limited curvature information by employing the diagonal matrix preconditioning of the stochastic gradient during the training. Despite their widespread, second-order optimization algorithms exhibit superior convergence properties compared to their first-order counterparts e.g. Adam and SGD. However, their practicality in training DNNs is still limited due to increased per-iteration computations compared to the first-order methods. We present \emph{AdaFisher}--an adaptive second-order optimizer that leverages a \emph{diagonal block-Kronecker} approximation of the Fisher information matrix for adaptive gradient preconditioning. AdaFisher aims to bridge the gap between enhanced \emph{convergence/generalization} capabilities and computational efficiency in second-order optimization framework for training DNNs. Despite the slow pace of second-order optimizers, we showcase that AdaFisher can be reliably adopted for image classification, language modeling and stands out for its stability and robustness in hyper-parameter tuning. We demonstrate that AdaFisher \textbf{outperforms the SOTA optimizers} in terms of both accuracy and convergence speed. Code is available from https://github.com/AtlasAnalyticsLab/AdaFisher.
Paper Structure (38 sections, 9 theorems, 40 equations, 25 figures, 17 tables, 1 algorithm)

This paper contains 38 sections, 9 theorems, 40 equations, 25 figures, 17 tables, 1 algorithm.

Key Result

Proposition 3.1

Let $(\nu_i, \beta_i) \in \mathbb{R}^{C_i}$ be the scale and shift parameters of a normalization layer $i$. The empirical KFs for the FIM approximation are where $h_{i-1}, s_i \in \mathbb{R}^{C_i \times |\mathcal{T}_i|}$ represent the pre-normalized activations and gradients, respectively. Here, $\mathcal{T}_i$ is the set of dimensions over which normalization statistics are computed, and $C_i$ i

Figures (25)

  • Figure 1: Visualizing optimization trajectories for various optimizers overlaid a loss landscape.
  • Figure 2: Illustration of EFIM computation using K-FAC for a given layer $i$.
  • Figure 3: Gershgorin disks and eigenvalue perturbations from the $37$th Convolutional Layer of ResNet-18 at steps 5200 (middle of training) and 9800 (end of training). Left: Gershgorin circles; Right: Eigenvalue spectrum w/w-o noise.
  • Figure 3: Validation of ImageNet-1k / ResNet50 by different optimizers reported on Top-1 and Top-5 accuracy.
  • Figure 4: Comparison of FIM diagonal histograms during ResNet18 training on CIFAR10: The figure displays the FIM diagonal elements for the first convolutional layer with Adam and AdaFisher over 1,000 training iterations.
  • ...and 20 more figures

Theorems & Definitions (13)

  • Proposition 3.1: EFIM for normalization layer
  • Proposition 3.2: Efficient EFIM
  • Proposition 3.3: Convergence in convex optimization
  • Proposition 3.4: Convergence in non-convex stochastic optimization
  • Theorem A.1: Gershgorin Circle Theorem
  • Proposition A.1: FIM for normalization layer
  • proof
  • Proposition A.2: Efficient FIM
  • proof
  • Proposition A.3: Convergence in convex optimization
  • ...and 3 more