Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

Xinwei Ou; Ce Zhu; Xiaolin Huang; Yipeng Liu

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

Xinwei Ou, Ce Zhu, Xiaolin Huang, Yipeng Liu

TL;DR

Second-order optimization in deep learning offers faster convergence but is hindered by costly inversions of curvature matrices. The paper proposes Fast Natural Gradient Descent (FNGD), which reformulates NGD preconditioning via the Sherman–Morrison–Woodbury formula as a fixed, per-epoch weighted sum of per-sample gradients, with these weights shared across epochs to avoid repeated inversions. This yields complexity near that of first-order methods while preserving second-order benefits, aided by an efficient per-sample gradient computation, layer-wise coefficient sharing, and a damping strategy. Empirical results on image classification and machine translation show FNGD achieving substantial speedups and competitive or superior accuracy and BLEU scores compared to state-of-the-art second-order methods and AdamW, highlighting its practical impact for scalable optimization in deep networks.

Abstract

Second-order optimization techniques have the potential to achieve faster convergence rates compared to first-order methods through the incorporation of second-order derivatives or statistics. However, their utilization in deep learning is limited due to their computational inefficiency. Various approaches have been proposed to address this issue, primarily centered on minimizing the size of the matrix to be inverted. Nevertheless, the necessity of performing the inverse operation iteratively persists. In this work, we present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. Specifically, it is revealed that natural gradient descent (NGD) is essentially a weighted sum of per-sample gradients. Our novel approach further proposes to share these weighted coefficients across epochs without affecting empirical performance. Consequently, FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods. Extensive experiments on image classification and machine translation tasks demonstrate the efficiency of the proposed FNGD. For training ResNet-18 on CIFAR-100, FNGD can achieve a speedup of 2.07$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

TL;DR

Abstract

compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.

Paper Structure (20 sections, 4 theorems, 36 equations, 9 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 4 theorems, 36 equations, 9 figures, 5 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Notation
Second-order Method
Natural Gradient Method
Proposed Method
SMW-based NGD
Coefficient-Sharing
Per-sample Gradient
Setting of Damping
Convergence
Experiments
Image Classification
Machine Translation
...and 5 more sections

Key Result

Theorem 1

Let Assumptions as1 and as2 hold. Suppose we optimize with FNGD using a damping value $\lambda = \frac{\lambda_{\min}}{M}$ and a small enough learning rate $\eta \leq \Tilde{\eta}$, we have $\|\mathbf{v}_k - y\|_2^2 \leq (1-\eta)^k \|\mathbf{v}_0 - y\|_2^2$.

Figures (9)

Figure 1: Illustration of FNGD. The gradient preconditioning formula in NGD can be equivalent to a weighted sum of per-sample gradients. By sharing these weighted coefficients across epochs, the proposed FNGD approximates the preconditioning step as a fixed-coefficient weighted sum. This approach reduces the computational complexity of FNGD to that of SGD.
Figure 2: Several existing types of FIM approximation. The green block represents feed-forward statistics, while the blue block represents back-propagation statistics.
Figure 3: The normalized correlation matrix $\mathbf{U}_l^\text{T}\mathbf{U}_l$ for four layers in ResNet-32 he2016deep on CIFAR-10 with batch size 128. The first two layers are adjacent, as are the last two layers.
Figure 4: Performance comparison between FNGD and NGD for training ResNet-32 on Cifar-10. We refer to the method with coefficient-sharing as FNGD.
Figure 5: The optimization curves of FNGD, SGD-m, KFAC, Shampoo, and Eva on ResNet-32 and VGG-11 with the CIFAR-10 dataset.
...and 4 more figures

Theorems & Definitions (7)

Theorem 1
Lemma 1
proof
Lemma 2
proof
Lemma 3
proof

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

TL;DR

Abstract

Inverse-Free Fast Natural Gradient Descent Method for Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (7)