Table of Contents
Fetching ...

Approximate Gaussianity Beyond Initialisation in Neural Networks

Edward Hirst, Sanjaye Ramgoolam

TL;DR

This work studies how neural network weight matrices evolve during MNIST training using permutation-invariant Gaussian matrix models (PIGMM) with 13 parameters. By computing linear, quadratic, cubic, and quartic invariants and fitting PIGMM to ensembles across initialisations and layers, the authors quantify approximate Gaussianity and track departures through training via deviation measures and the Wasserstein distance. They find that initialization is well-described by a simple Gaussian in the PIGMM class, but training induces non-Gaussian correlations that are captured by more general PIGMMs; higher-order invariants are still well-predicted by the fitted models. Architectural changes reveal robustness of the framework under regularisation but indicate limits in very wide networks, motivating extensions to include higher-degree invariants and non-square/bipartite structures for broader applicability.

Abstract

Ensembles of neural network weight matrices are studied through the training process for the MNIST classification problem, testing the efficacy of matrix models for representing their distributions, under assumptions of Gaussianity and permutation-symmetry. The general 13-parameter permutation invariant Gaussian matrix models are found to be effective models for the correlated Gaussianity in the weight matrices, beyond the range of applicability of the simple Gaussian with independent identically distributed matrix variables, and notably well beyond the initialisation step. The representation theoretic model parameters, and the graph-theoretic characterisation of the permutation invariant matrix observables give an interpretable framework for the best-fit model and for small departures from Gaussianity. Additionally, the Wasserstein distance is calculated for this class of models and used to quantify the movement of the distributions over training. Throughout the work, the effects of varied initialisation regimes, regularisation, layer depth, and layer width are tested for this formalism, identifying limits where particular departures from Gaussianity are enhanced and how more general, yet still highly-interpretable, models can be developed.

Approximate Gaussianity Beyond Initialisation in Neural Networks

TL;DR

This work studies how neural network weight matrices evolve during MNIST training using permutation-invariant Gaussian matrix models (PIGMM) with 13 parameters. By computing linear, quadratic, cubic, and quartic invariants and fitting PIGMM to ensembles across initialisations and layers, the authors quantify approximate Gaussianity and track departures through training via deviation measures and the Wasserstein distance. They find that initialization is well-described by a simple Gaussian in the PIGMM class, but training induces non-Gaussian correlations that are captured by more general PIGMMs; higher-order invariants are still well-predicted by the fitted models. Architectural changes reveal robustness of the framework under regularisation but indicate limits in very wide networks, motivating extensions to include higher-degree invariants and non-square/bipartite structures for broader applicability.

Abstract

Ensembles of neural network weight matrices are studied through the training process for the MNIST classification problem, testing the efficacy of matrix models for representing their distributions, under assumptions of Gaussianity and permutation-symmetry. The general 13-parameter permutation invariant Gaussian matrix models are found to be effective models for the correlated Gaussianity in the weight matrices, beyond the range of applicability of the simple Gaussian with independent identically distributed matrix variables, and notably well beyond the initialisation step. The representation theoretic model parameters, and the graph-theoretic characterisation of the permutation invariant matrix observables give an interpretable framework for the best-fit model and for small departures from Gaussianity. Additionally, the Wasserstein distance is calculated for this class of models and used to quantify the movement of the distributions over training. Throughout the work, the effects of varied initialisation regimes, regularisation, layer depth, and layer width are tested for this formalism, identifying limits where particular departures from Gaussianity are enhanced and how more general, yet still highly-interpretable, models can be developed.

Paper Structure

This paper contains 35 sections, 116 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 2.1: A diagrammatic representation of a general neural network with three hidden layers; this matches the main architecture used in this work, where the numbers below indicate the number of neurons in each layer in the used architecture. Each neuron represents the action of a linear and then a non-linear function on its input vector, the linear action in this work is multiplication by a weight matrix (no biases) represented by the arrows in the diagram Armstrong-Williams:2024nzy. The graphical nature between layers is that of a complete bipartite graph.
  • Figure 2.2: Sample images from the MNIST database of handwritten digits. Each row shows examples from the database of the digits $0 - 9$.
  • Figure 4.1: Variation of the linear invariant deviations (solid lines) and quadratic invariant deviations (dashed lines), labelled respectively by their invariants $I_1-I_{13}$, across the 50 epochs of training. The legend is the same throughout and collectively shown at the bottom (g) for readability. We emphasise the varying scales in y-axis.
  • Figure 4.2: Differences in the deviations of the low-node cubic (light blue), high-node cubic (dark blue), low-node quartic (light orange), and high-node quartic (dark orange) invariants $I_{14}-I_{52}$, between the end of training and prior to training, normalised relative to the sum of the values after and prior to training; displayed for all initialisations, and layers (L#) considered.
  • Figure 4.3: Variation of the cubic deviations (solid lines) and quartic deviations (dashed lines), labelled respectively by their invariants $I_{14}-I_{52}$, across the 50 epochs of training. The legend is the same throughout and collectively shown at the bottom (g) for readability. Note the y-axes scales are fixed within initialisations, but differ between them.
  • ...and 10 more figures