Table of Contents
Fetching ...

An Improved Empirical Fisher Approximation for Natural Gradient Descent

Xiaodong Wu, Wenyi Yu, Chao Zhang, Philip Woodland

TL;DR

This paper investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of its poor empirical approximation quality and proposes an improved empirical Fisher (iEF) method, which is motivated as a generalised NGD method from a loss reduction perspective, meanwhile retaining the practical convenience of EF.

Abstract

Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of its poor empirical approximation quality. An improved empirical Fisher (iEF) method is proposed to address this issue, which is motivated as a generalised NGD method from a loss reduction perspective, meanwhile retaining the practical convenience of EF. The exact iEF and EF methods are experimentally evaluated using practical deep learning setups. Optimisation experiments show that applying exact iEF directly as an optimiser provides strong convergence and generalisation. Additionally, under a novel empirical evaluation framework, the proposed iEF method shows consistently better approximation quality to exact Natural Gradient updates than both the EF and the more expensive sampled Fisher methods, meanwhile demonstrating the superior property of being robust to the choice of damping across tasks and training stages. Improving existing approximate NGD optimisers with iEF is expected to lead to better convergence and robustness. Furthermore, the iEF method also serves as a better approximation method to the Fisher information matrix itself, which enables the improvement of a variety of Fisher-based methods, not limited to the scope of optimisation.

An Improved Empirical Fisher Approximation for Natural Gradient Descent

TL;DR

This paper investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of its poor empirical approximation quality and proposes an improved empirical Fisher (iEF) method, which is motivated as a generalised NGD method from a loss reduction perspective, meanwhile retaining the practical convenience of EF.

Abstract

Approximate Natural Gradient Descent (NGD) methods are an important family of optimisers for deep learning models, which use approximate Fisher information matrices to pre-condition gradients during training. The empirical Fisher (EF) method approximates the Fisher information matrix empirically by reusing the per-sample gradients collected during back-propagation. Despite its ease of implementation, the EF approximation has its theoretical and practical limitations. This paper investigates the inversely-scaled projection issue of EF, which is shown to be a major cause of its poor empirical approximation quality. An improved empirical Fisher (iEF) method is proposed to address this issue, which is motivated as a generalised NGD method from a loss reduction perspective, meanwhile retaining the practical convenience of EF. The exact iEF and EF methods are experimentally evaluated using practical deep learning setups. Optimisation experiments show that applying exact iEF directly as an optimiser provides strong convergence and generalisation. Additionally, under a novel empirical evaluation framework, the proposed iEF method shows consistently better approximation quality to exact Natural Gradient updates than both the EF and the more expensive sampled Fisher methods, meanwhile demonstrating the superior property of being robust to the choice of damping across tasks and training stages. Improving existing approximate NGD optimisers with iEF is expected to lead to better convergence and robustness. Furthermore, the iEF method also serves as a better approximation method to the Fisher information matrix itself, which enables the improvement of a variety of Fisher-based methods, not limited to the scope of optimisation.
Paper Structure (71 sections, 3 theorems, 61 equations, 15 figures, 8 tables, 4 algorithms)

This paper contains 71 sections, 3 theorems, 61 equations, 15 figures, 8 tables, 4 algorithms.

Key Result

Theorem 5.3

Suppose Assumption ass:full_rank_covariance holds, $\forall n \in \{1,\dots, N\}$, the target probability $\hat{p}_n(t) := p_{\bm\theta(t)}(y=y_n|\bm x_n)$ for the $n$-th training sample is bounded as follows where $C_0 = \frac{1}{1-\hat{p}_n(0)}+\log\frac{\hat{p}_n(0)}{1-\hat{p}_n(0)}$ and $t>max\{-1-C_0, 0\}$.

Figures (15)

  • Figure 1: A visual comparison of Fisher, iEF and EF as pre-conditioners for a 2-parameter 2-datum linear least-squares regression problem inspired by EF-limitation (see Appendix \ref{['sec:app-toy-lls-2pt']} for details). All three plots are loss landscapes with the $x$-axis and $y$-axis representing $\theta_0$ and $\theta_1$ respectively. The first plot shows the gradient vector field of the loss function and 5 sampled training trajectories for SGD updates. Similarly, the second plot is for NGD/iEF updates and the third plot is for EF updates (with a zoomed view). The global minimum (0, 0) is marked with a star where visible. The two dashed lines on all plots represent the optimal parameter sets for each training sample. It can be seen that the EF method has a highly distorted update vector field while the iEF and NGD methods adapt to the curvature of the problem successfully.
  • Figure 2: Four (log-scaled) ratios computed for checkpoints at various stages of training (sampled at the interval of one epoch) for 3 of the all 15 tasks. The $x$-axes represent the training stages of the model. $0\%$ means the initialised model and $100\%$ means model at the end of the last epoch. Each data point is averaged across 100 evaluations, and the error bars represent the standard deviation (1-sigma). The first plot shows $\gamma_\text{EF} / \gamma_\text{SGD}$, which denotes the relative approximation quality improvement of EF updates w.r.t. SGD updates (the lower the better). The second plot shows $\gamma_\text{iEF} / \gamma_\text{SGD}$, and the third plot shows $\gamma_\text{SF} / \gamma_\text{EF}$. The last plot depicts the imbalance of gradient norms, which is the average ratio between the maximum and minimum gradient norm for each evaluated batch (a larger value indicates more imbalanced per-sample gradient norms, which should lead to a more significant inversely-scaled projection issue). Overall, the approximation quality follows iEF $>$ SF $>$ EF.
  • Figure 3: Approximation quality (relative to SGD) of EF, SF and iEF methods w.r.t. damping factor $\lambda$ at different training stages of task CoLA+T5+LoRA. $x$-axes show the value of the damping factor, $y$-axes depict the relative approximation quality improvement of the target update method w.r.t. SGD (the lower the better). Each data point is averaged across 100 evaluations, and the error-bars represent the standard deviation (1-sigma). The first plot is for checkpoint saved at the end of the first training epoch, the second plot for the mid-way epoch and the third plot for the final epoch. It can be observed that iEF achieves the best approximation quality robustly for any near-zero $\lambda$. In contrast, $\lambda$ has a non-linear impact on both SF and EF. When optimally tuned, an EF update can achieve better approximation quality than SGD, and an SF update can achieve comparable quality to iEF. However, the optimal damping factor for EF and SF changes greatly with training stages (and tasks).
  • Figure 4: A visual comparison of Fisher, iEF and EF as pre-conditioners for a logistic regression problem (classifying two 1D datum $x_0 = 0, x_1 = 2$ into two classes). The target model is $p_{\bm\theta}(x_n) = \sigma(\theta_0 + \theta_1 x_n)$, where $\sigma(\cdot)$ is the Sigmoid function and CE loss is used, which follows the problem setup description in Sec. \ref{['sec:sl-softmax-setup']}. This figure is different from Fig. \ref{['fig:visual_tour']} in 3 aspects: 1) iEF and NG updates are no longer identical, and are presented in separate plots; 2) There is no global minima, but the model achieves lower loss when moving further down the bottom-right corner; 3) The dashed line now represents the optimal parameter set for a decision boundary of $x=1$. The training trajectory of EF is still ill-behaved, meanwhile both NG and iEF updates move towards the optimal decision boundary smoothly.
  • Figure 5: Approximation quality (relative to SGD) of "un-damped" iEF, ieKFAC, KFAC and eKFAC for 3 selected PEFT tasks (QNLI+LoRA, RTE+LoRA, MRPC+LoRA) across training stages. The style of the visualisation follows that for the first 3 plots of Fig. \ref{['fig:sample_gamma']}. This evaluation shows that, ieKFAC update has a similar approximation quality to the exact iEF method, and a much better approximation quality than both KFAC and eKFAC in most training stages. This demonstrates the effectiveness of using ieKFAC to approximate iEF and its potential of further improving the approximation quality of existing KFAC-based methods.
  • ...and 10 more figures

Theorems & Definitions (3)

  • Theorem 5.3
  • Theorem 5.4
  • Lemma C.1