NGD converges to less degenerate solutions than SGD

Moosa Saghir; N. R. Raghavendra; Zihe Liu; Evan Ryan Gunter

NGD converges to less degenerate solutions than SGD

Moosa Saghir, N. R. Raghavendra, Zihe Liu, Evan Ryan Gunter

TL;DR

Singular learning theory (SLT) proposes the learning coefficient, which is described the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, and incorporates information from higher-order terms.

Abstract

The number of free parameters, or dimension, of a model is a straightforward way to measure its complexity: a model with more parameters can encode more information. However, this is not an accurate measure of complexity: models capable of memorizing their training data often generalize well despite their high dimension. Effective dimension aims to more directly capture the complexity of a model by counting only the number of parameters required to represent the functionality of the model. Singular learning theory (SLT) proposes the learning coefficient $ λ$ as a more accurate measure of effective dimension. By describing the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, $ λ$ incorporates information from higher-order terms. We compare $ λ$ of models trained using natural gradient descent (NGD) and stochastic gradient descent (SGD), and find that those trained with NGD consistently have a higher effective dimension for both of our methods: the Hessian trace $ \text{Tr}(\mathbf{H}) $, and the estimate of the local learning coefficient (LLC) $ \hatλ(w^*) $.

NGD converges to less degenerate solutions than SGD

TL;DR

Abstract

as a more accurate measure of effective dimension. By describing the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss,

incorporates information from higher-order terms. We compare

of models trained using natural gradient descent (NGD) and stochastic gradient descent (SGD), and find that those trained with NGD consistently have a higher effective dimension for both of our methods: the Hessian trace

, and the estimate of the local learning coefficient (LLC)

Paper Structure (22 sections, 33 equations, 8 figures)

This paper contains 22 sections, 33 equations, 8 figures.

Introduction
SLT Motivation
Natural Gradient Descent
Theoretical framework of SLT
Background
The LLC $\lambda(w^*$)
Generalization
Methodology
Summary
Estimating the local learning coefficient $\hat{\lambda}(w^*)$ and the WBIC
Experiments and results
NGD solutions have higher $\hat{\lambda}$ and $\text{Tr}(\mathbf{H})$ than SGD
Reducing Fisher matrix smoothing increases the LLC of NGD
NGD escaping a basin with highly degenerate singularities
Conclusion
...and 7 more sections

Figures (8)

Figure 1: NGD solutions have higher $\hat{\lambda}$, with highest t-value of $1.9^{-31}$, over a range of model sizes. Hyperparameters used: $\alpha = 10^{-2}$, $\text{learning rate} = 10^{-2}$, $\epsilon = 10^{-10}$, $\text{batch size} = 128$.
Figure 2: Hyperparameters used: $\alpha = 10^{-1}$, $\text{lr} = 10^{-2}$, $\epsilon = 10^{-10}$, $\text{batch size} = 128$.
Figure 3:
Figure 4: LLC and validation loss after splitting, using fashion-MNIST dataset. $\text{SGD lr} = 10^{-2},\text{NGD lr} = 10^{-2}$
Figure 5: LLC and validation loss after splitting, using MNIST dataset.
...and 3 more figures

NGD converges to less degenerate solutions than SGD

TL;DR

Abstract

NGD converges to less degenerate solutions than SGD

Authors

TL;DR

Abstract

Table of Contents

Figures (8)