Table of Contents
Fetching ...

Information Geometry of Evolution of Neural Network Parameters While Training

Abhiram Anand Thiruthummal, Eun-jin Kim, Sergiy Shelyag

TL;DR

This work applies information geometry to neural network training by treating time-evolving parameter PDFs under the Fisher information metric as points on a Riemannian manifold. It defines information length $\\mathcal{L}$ and information velocity $\\Gamma$ to quantify the trajectory of parameter distributions during training and shows that phase-transition-like changes in these quantities correlate with overfitting across MNIST, Fashion-MNIST, and CIFAR-10, including deep architectures like ResNet-50. Regularized-derivative techniques extract robust derivatives of $\\mathcal{L}$, revealing minima in $d^2(\\log \\mathcal{L})/d(\\log t)^2$ that align with, or precede, overfitting signals without using test data. The paper draws an analogy to thermodynamic phase transitions, discusses finite-size scaling behavior, and demonstrates the approach’s potential as a tool for understanding and forecasting overfitting in neural networks. Overall, the methodology promises enhanced interpretability of training dynamics and offers a novel, data-independent indicator for overfitting across architectures.

Abstract

Artificial neural networks (ANNs) are powerful tools capable of approximating any arbitrary mathematical function, but their interpretability remains limited, rendering them as black box models. To address this issue, numerous methods have been proposed to enhance the explainability and interpretability of ANNs. In this study, we introduce the application of information geometric framework to investigate phase transition-like behavior during the training of ANNs and relate these transitions to overfitting in certain models. The evolution of ANNs during training is studied by looking at the probability distribution of its parameters. Information geometry utilizing the principles of differential geometry, offers a unique perspective on probability and statistics by considering probability density functions as points on a Riemannian manifold. We create this manifold using a metric based on Fisher information to define a distance and a velocity. By parameterizing this distance and velocity with training steps, we study how the ANN evolves as training progresses. Utilizing standard datasets like MNIST, FMNIST and CIFAR-10, we observe a transition in the motion on the manifold while training the ANN and this transition is identified with over-fitting in the ANN models considered. The information geometric transitions observed is shown to be mathematically similar to the phase transitions in physics. Preliminary results showing finite-size scaling behavior is also provided. This work contributes to the development of robust tools for improving the explainability and interpretability of ANNs, aiding in our understanding of the variability of the parameters these complex models exhibit during training.

Information Geometry of Evolution of Neural Network Parameters While Training

TL;DR

This work applies information geometry to neural network training by treating time-evolving parameter PDFs under the Fisher information metric as points on a Riemannian manifold. It defines information length and information velocity to quantify the trajectory of parameter distributions during training and shows that phase-transition-like changes in these quantities correlate with overfitting across MNIST, Fashion-MNIST, and CIFAR-10, including deep architectures like ResNet-50. Regularized-derivative techniques extract robust derivatives of , revealing minima in that align with, or precede, overfitting signals without using test data. The paper draws an analogy to thermodynamic phase transitions, discusses finite-size scaling behavior, and demonstrates the approach’s potential as a tool for understanding and forecasting overfitting in neural networks. Overall, the methodology promises enhanced interpretability of training dynamics and offers a novel, data-independent indicator for overfitting across architectures.

Abstract

Artificial neural networks (ANNs) are powerful tools capable of approximating any arbitrary mathematical function, but their interpretability remains limited, rendering them as black box models. To address this issue, numerous methods have been proposed to enhance the explainability and interpretability of ANNs. In this study, we introduce the application of information geometric framework to investigate phase transition-like behavior during the training of ANNs and relate these transitions to overfitting in certain models. The evolution of ANNs during training is studied by looking at the probability distribution of its parameters. Information geometry utilizing the principles of differential geometry, offers a unique perspective on probability and statistics by considering probability density functions as points on a Riemannian manifold. We create this manifold using a metric based on Fisher information to define a distance and a velocity. By parameterizing this distance and velocity with training steps, we study how the ANN evolves as training progresses. Utilizing standard datasets like MNIST, FMNIST and CIFAR-10, we observe a transition in the motion on the manifold while training the ANN and this transition is identified with over-fitting in the ANN models considered. The information geometric transitions observed is shown to be mathematically similar to the phase transitions in physics. Preliminary results showing finite-size scaling behavior is also provided. This work contributes to the development of robust tools for improving the explainability and interpretability of ANNs, aiding in our understanding of the variability of the parameters these complex models exhibit during training.
Paper Structure (16 sections, 11 equations, 22 figures, 1 table)

This paper contains 16 sections, 11 equations, 22 figures, 1 table.

Figures (22)

  • Figure 1: Neural architecture used in this work for MNIST and Fashion-MNIST classification task.
  • Figure 2: An example of (left) information velocity and (right) information length estimates for the network described in Fig. \ref{['fig:nn']} trained on MNIST data using SGD optimizer with learning rate 0.045.
  • Figure 3: $\Gamma$ calculated from $\mathcal{L}$ shown in Fig. \ref{['fig:il_gamma_example']} using regularized derivative. Here $\lambda = 100$.
  • Figure 4: Information Length of different optimizers for the MNIST dataset. The activation function used was Leaky ReLU and learning rate was 0.013.
  • Figure 5: Information length for (left) SGD and (right) Adadelta optimizers with different activation functions trained on MNIST dataset.
  • ...and 17 more figures