Inverse-Free Fast Natural Gradient Descent Method for Deep Learning
Xinwei Ou, Ce Zhu, Xiaolin Huang, Yipeng Liu
TL;DR
Second-order optimization in deep learning offers faster convergence but is hindered by costly inversions of curvature matrices. The paper proposes Fast Natural Gradient Descent (FNGD), which reformulates NGD preconditioning via the Sherman–Morrison–Woodbury formula as a fixed, per-epoch weighted sum of per-sample gradients, with these weights shared across epochs to avoid repeated inversions. This yields complexity near that of first-order methods while preserving second-order benefits, aided by an efficient per-sample gradient computation, layer-wise coefficient sharing, and a damping strategy. Empirical results on image classification and machine translation show FNGD achieving substantial speedups and competitive or superior accuracy and BLEU scores compared to state-of-the-art second-order methods and AdamW, highlighting its practical impact for scalable optimization in deep networks.
Abstract
Second-order optimization techniques have the potential to achieve faster convergence rates compared to first-order methods through the incorporation of second-order derivatives or statistics. However, their utilization in deep learning is limited due to their computational inefficiency. Various approaches have been proposed to address this issue, primarily centered on minimizing the size of the matrix to be inverted. Nevertheless, the necessity of performing the inverse operation iteratively persists. In this work, we present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch. Specifically, it is revealed that natural gradient descent (NGD) is essentially a weighted sum of per-sample gradients. Our novel approach further proposes to share these weighted coefficients across epochs without affecting empirical performance. Consequently, FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods. Extensive experiments on image classification and machine translation tasks demonstrate the efficiency of the proposed FNGD. For training ResNet-18 on CIFAR-100, FNGD can achieve a speedup of 2.07$\times$ compared with KFAC. For training Transformer on Multi30K, FNGD outperforms AdamW by 24 BLEU score while requiring almost the same training time.
