Trade-Offs of Diagonal Fisher Information Matrix Estimators

Alexander Soen; Ke Sun

Trade-Offs of Diagonal Fisher Information Matrix Estimators

Alexander Soen, Ke Sun

TL;DR

It is found that the variance quantities depend on the non-linearity wrt different parameter groups and should not be neglected when estimating the Fisher information.

Abstract

The Fisher information matrix can be used to characterize the local geometry of the parameter space of neural networks. It elucidates insightful theories and useful tools to understand and optimize neural networks. Given its high computational cost, practitioners often use random estimators and evaluate only the diagonal entries. We examine two popular estimators whose accuracy and sample complexity depend on their associated variances. We derive bounds of the variances and instantiate them in neural networks for regression and classification. We navigate trade-offs for both estimators based on analytical and numerical studies. We find that the variance quantities depend on the non-linearity wrt different parameter groups and should not be neglected when estimating the Fisher information.

Trade-Offs of Diagonal Fisher Information Matrix Estimators

TL;DR

It is found that the variance quantities depend on the non-linearity wrt different parameter groups and should not be neglected when estimating the Fisher information.

Abstract

Paper Structure (43 sections, 20 theorems, 103 equations, 13 figures, 1 table)

This paper contains 43 sections, 20 theorems, 103 equations, 13 figures, 1 table.

Settings
Related Work
Variance of Diagonal FIM Estimators
Practical Variance Estimation
Joint FIM Estimators
Case Studies
Regression: Isotropic Gaussian Distribution
Classification: Categorical Distribution
Empirical Verification: Classification
Relationship with the "Empirical Fisher"
Conclusion
Natural Gradient Toy Data Example
Data
Model
Training
...and 28 more sections

Key Result

Lemma 3.1

$\forall\bm x \in \Re^{I}$, $\forall{i}=1,\ldots,\dim(\bm\theta)$,

Figures (13)

Figure 1: Natural gradient (NG) descent using $\hat{\mathcal{I}}_{1}(\bm \theta)$ / $\hat{\mathcal{I}}_{2}(\bm \theta)$ on a 2D toy dataset for regression (linear regression) and classification (logistic regression) (details in \ref{['sec:teaser']}). Inset plot shows the parameter updates throughout training. Here, the variance of $\hat{\mathcal{I}}_{2}(\bm \theta)$ is generally lower than $\hat{\mathcal{I}}_{1}(\bm \theta)$.
Figure 2: MNIST for a 4-layer MLP with sigmoid activations. Top: The estimated Fisher information (FI), variances, and variance bounds across 4 parameter groups and 20 training epochs. The FI (green line) is estimated using $\hat{\mathcal{I}}_{1}$ ($\hat{\mathcal{I}}_{2}$ is almost identical and not shown for clarity). The s.t.d.(square root of variance) is shown for variances and their bounds. Bottom: the log-ratio of \ref{['thm:efim_conditional_bounds']}'s upper bounds (UBs) and the true variances. The closer to 0, the better the UB. In the right most column, the variance of $\hat{\mathcal{I}}_{2}$ vanishes: ${\mathcal{V}}_{2}(\theta_i \,\vert\, \bm {x}) = 0 \le {\mathcal{V}}_{1}(\theta_i \,\vert\, \bm {x})$. Thus related curves of $\hat{\mathcal{I}}_{2}$ are not shown.
Figure I: Extended version of \ref{['fig:teaser']} with the sum of variance of FIM estimators over epochs.
Figure II: \ref{['fig:teaser_var']} over different randomizations (a).
Figure III: \ref{['fig:teaser_var']} over different randomizations (b).
...and 8 more figures

Theorems & Definitions (46)

Lemma 3.1
Theorem 4.1
Corollary 4.2
Proposition 4.3
Remark 4.4
Remark 4.5
Corollary 4.6
Theorem 4.7
Lemma 4.8
Proposition 5.1
...and 36 more

Trade-Offs of Diagonal Fisher Information Matrix Estimators

TL;DR

Abstract

Trade-Offs of Diagonal Fisher Information Matrix Estimators

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (13)

Theorems & Definitions (46)