Table of Contents
Fetching ...

Collective variables of neural networks: empirical time evolution and scaling laws

Samuel Tovey, Sven Krippendorf, Michael Spannowsky, Konstantin Nikolaou, Christian Holm

TL;DR

It is argued that this network entropy evolution be considered the onset of a deep learning regime, due to the ubiquity of the latter in deep neural network architectures and its flexibility in the creation of feature-rich representations.

Abstract

This work presents a novel means for understanding learning dynamics and scaling relations in neural networks. We show that certain measures on the spectrum of the empirical neural tangent kernel, specifically entropy and trace, yield insight into the representations learned by a neural network and how these can be improved through architecture scaling. These results are demonstrated first on test cases before being shown on more complex networks, including transformers, auto-encoders, graph neural networks, and reinforcement learning studies. In testing on a wide range of architectures, we highlight the universal nature of training dynamics and further discuss how it can be used to understand the mechanisms behind learning in neural networks. We identify two such dominant mechanisms present throughout machine learning training. The first, information compression, is seen through a reduction in the entropy of the NTK spectrum during training, and occurs predominantly in small neural networks. The second, coined structure formation, is seen through an increasing entropy and thus, the creation of structure in the neural network representations beyond the prior established by the network at initialization. Due to the ubiquity of the latter in deep neural network architectures and its flexibility in the creation of feature-rich representations, we argue that this form of evolution of the network's entropy be considered the onset of a deep learning regime.

Collective variables of neural networks: empirical time evolution and scaling laws

TL;DR

It is argued that this network entropy evolution be considered the onset of a deep learning regime, due to the ubiquity of the latter in deep neural network architectures and its flexibility in the creation of feature-rich representations.

Abstract

This work presents a novel means for understanding learning dynamics and scaling relations in neural networks. We show that certain measures on the spectrum of the empirical neural tangent kernel, specifically entropy and trace, yield insight into the representations learned by a neural network and how these can be improved through architecture scaling. These results are demonstrated first on test cases before being shown on more complex networks, including transformers, auto-encoders, graph neural networks, and reinforcement learning studies. In testing on a wide range of architectures, we highlight the universal nature of training dynamics and further discuss how it can be used to understand the mechanisms behind learning in neural networks. We identify two such dominant mechanisms present throughout machine learning training. The first, information compression, is seen through a reduction in the entropy of the NTK spectrum during training, and occurs predominantly in small neural networks. The second, coined structure formation, is seen through an increasing entropy and thus, the creation of structure in the neural network representations beyond the prior established by the network at initialization. Due to the ubiquity of the latter in deep neural network architectures and its flexibility in the creation of feature-rich representations, we argue that this form of evolution of the network's entropy be considered the onset of a deep learning regime.

Paper Structure

This paper contains 21 sections, 11 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Time evolution diagrams for the convolutional MNIST study. The top row shows the evolution difference due to a changing activation function. The second row is width scaling, and the third is depth scaling. The tuple over the plots highlights the architecture being studied. It is structured as (Width, Depth, Activation), where an $x$ indicates that this property is being changed during the study.
  • Figure 2: Architecture scaling sweep of a dense neural network trained on the MNIST dataset. The top row a) shows the entropy, trace and loss as a function of network width and depth at initialization. The bottom row, b), shows the same data after training the model.
  • Figure 3: Collective variable evolution for the novelty architectures. In each case, the raw value, along with a running average, is shown. (a) A resnet18 trained on the CIFAR10 data set. (b) A transformer trained on part-of-speech tagging of ancient Greek. (c) A CNN-based reinforcement learner trained on the arcade game Pong. Note the use of reward instead of loss on the y-axis. (d) A graph neural network trained on molecular property prediction. (e) A simple variational auto-encoder trained to reproduce the MNIST dataset.
  • Figure 4: Time evolution diagrams for the dense MNIST study. The top row shows the evolution difference due to a changing activation function. The second row is width scaling, and the third is depth scaling.
  • Figure 5: Time evolution diagrams for the dense MPG regression dataset study. The top row shows the evolution difference due to a changing activation function. The second row is width scaling, and the third is depth scaling.