Table of Contents
Fetching ...

NGD converges to less degenerate solutions than SGD

Moosa Saghir, N. R. Raghavendra, Zihe Liu, Evan Ryan Gunter

TL;DR

Singular learning theory (SLT) proposes the learning coefficient, which is described the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, and incorporates information from higher-order terms.

Abstract

The number of free parameters, or dimension, of a model is a straightforward way to measure its complexity: a model with more parameters can encode more information. However, this is not an accurate measure of complexity: models capable of memorizing their training data often generalize well despite their high dimension. Effective dimension aims to more directly capture the complexity of a model by counting only the number of parameters required to represent the functionality of the model. Singular learning theory (SLT) proposes the learning coefficient $ λ$ as a more accurate measure of effective dimension. By describing the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, $ λ$ incorporates information from higher-order terms. We compare $ λ$ of models trained using natural gradient descent (NGD) and stochastic gradient descent (SGD), and find that those trained with NGD consistently have a higher effective dimension for both of our methods: the Hessian trace $ \text{Tr}(\mathbf{H}) $, and the estimate of the local learning coefficient (LLC) $ \hatλ(w^*) $.

NGD converges to less degenerate solutions than SGD

TL;DR

Singular learning theory (SLT) proposes the learning coefficient, which is described the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, and incorporates information from higher-order terms.

Abstract

The number of free parameters, or dimension, of a model is a straightforward way to measure its complexity: a model with more parameters can encode more information. However, this is not an accurate measure of complexity: models capable of memorizing their training data often generalize well despite their high dimension. Effective dimension aims to more directly capture the complexity of a model by counting only the number of parameters required to represent the functionality of the model. Singular learning theory (SLT) proposes the learning coefficient as a more accurate measure of effective dimension. By describing the rate of increase of the volume of the region of parameter space around a local minimum with respect to loss, incorporates information from higher-order terms. We compare of models trained using natural gradient descent (NGD) and stochastic gradient descent (SGD), and find that those trained with NGD consistently have a higher effective dimension for both of our methods: the Hessian trace , and the estimate of the local learning coefficient (LLC) .
Paper Structure (22 sections, 33 equations, 8 figures)

This paper contains 22 sections, 33 equations, 8 figures.

Figures (8)

  • Figure 1: NGD solutions have higher $\hat{\lambda}$, with highest t-value of $1.9^{-31}$, over a range of model sizes. Hyperparameters used: $\alpha = 10^{-2}$, $\text{learning rate} = 10^{-2}$, $\epsilon = 10^{-10}$, $\text{batch size} = 128$.
  • Figure 2: Hyperparameters used: $\alpha = 10^{-1}$, $\text{lr} = 10^{-2}$, $\epsilon = 10^{-10}$, $\text{batch size} = 128$.
  • Figure 3:
  • Figure 4: LLC and validation loss after splitting, using fashion-MNIST dataset. $\text{SGD lr} = 10^{-2},\text{NGD lr} = 10^{-2}$
  • Figure 5: LLC and validation loss after splitting, using MNIST dataset.
  • ...and 3 more figures