Table of Contents
Fetching ...

Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

Phan-Minh Nguyen

TL;DR

The paper develops a mean-field limit for multilayer neural networks trained with SGD, showing that as the width grows, the learning dynamics converge to a nontrivial, width-independent regime. It introduces a MF formalism built on marginal uniformity and self-averaging, using stochastic kernels to represent intermediate-layer neurons and only their conditional expectations, and derives forward, backward, and evolution equations linking finite networks to deterministic MF dynamics. Theoretical statics results and extensive experiments on isotropic Gaussian tasks, MNIST, CIFAR-10, and CNN-like architectures validate the MF predictions, demonstrating width-invariant yet nontrivial learning behavior. This work extends established two-layer MF limits to deep architectures, offering a principled framework for analyzing and potentially guiding the design of wide multilayer networks.

Abstract

Can multilayer neural networks -- typically constructed as highly complex structures with many nonlinearly activated neurons across layers -- behave in a non-trivial way that yet simplifies away a major part of their complexities? In this work, we uncover a phenomenon in which the behavior of these complex networks -- under suitable scalings and stochastic gradient descent dynamics -- becomes independent of the number of neurons as this number grows sufficiently large. We develop a formalism in which this many-neurons limiting behavior is captured by a set of equations, thereby exposing a previously unknown operating regime of these networks. While the current pursuit is mathematically non-rigorous, it is complemented with several experiments that validate the existence of this behavior.

Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks

TL;DR

The paper develops a mean-field limit for multilayer neural networks trained with SGD, showing that as the width grows, the learning dynamics converge to a nontrivial, width-independent regime. It introduces a MF formalism built on marginal uniformity and self-averaging, using stochastic kernels to represent intermediate-layer neurons and only their conditional expectations, and derives forward, backward, and evolution equations linking finite networks to deterministic MF dynamics. Theoretical statics results and extensive experiments on isotropic Gaussian tasks, MNIST, CIFAR-10, and CNN-like architectures validate the MF predictions, demonstrating width-invariant yet nontrivial learning behavior. This work extends established two-layer MF limits to deep architectures, offering a principled framework for analyzing and potentially guiding the design of wide multilayer networks.

Abstract

Can multilayer neural networks -- typically constructed as highly complex structures with many nonlinearly activated neurons across layers -- behave in a non-trivial way that yet simplifies away a major part of their complexities? In this work, we uncover a phenomenon in which the behavior of these complex networks -- under suitable scalings and stochastic gradient descent dynamics -- becomes independent of the number of neurons as this number grows sufficiently large. We develop a formalism in which this many-neurons limiting behavior is captured by a set of equations, thereby exposing a previously unknown operating regime of these networks. While the current pursuit is mathematically non-rigorous, it is complemented with several experiments that validate the existence of this behavior.

Paper Structure

This paper contains 33 sections, 102 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The performance of five $4$-layers fully-connected networks on MNIST classification, plotted against training iterations. The number of neurons at each hidden layer is 100, 200, 400, 800 or 1600, for each network. Details are available in Section \ref{['sec:Validation']}.
  • Figure 2: (a): A graphical representation of a two-layers network, as in Eq. (\ref{['eq:two-layers-net']}). (b): An equivalent representation for $\sigma\left(\boldsymbol{x};\boldsymbol{\theta}_{i}\right)=\beta_{i}\varphi\left(\left\langle \boldsymbol{w}_{i},\boldsymbol{x}\right\rangle +b_{i}\right)$.
  • Figure 3: (a): A graphical representation of a three-layers network. (b): An equivalent representation, as proposed in Section \ref{['subsec:three-layers-derivation']}. Neuron $j$ of the second layer is represented by $\nu_{j}$, and $f_{j}={\rm CE}\left\{ \nu_{j}\right\}$. Notice that neuron $j$ of the second layer receives the forward pass information averaged over all neurons of the first layer. Likewise, neuron $i$ of the first layer receives the backward pass information averaged over all neurons of the second layer. However neuron $j$ of the third layer does not average its received forward pass information over all neurons of the second layer, due to its connectivity. Likewise, neuron $j$ of the second layer does not average its received backward pass information over all neurons of the third layer.
  • Figure 4: A graphical representation of a multilayer neural network, with $L+1$ fully-connected layers. Here neuron $j$ at layer $\ell>1$ is represented by $\left(\boldsymbol{\theta}_{\ell,j},f_{\ell,j}\right)$ to be consistent with the information presented in Section \ref{['sec:MF-Multilayer']}, while we note the actual representation is $\left(\boldsymbol{\theta}_{\ell,j},\nu_{\ell,j}\right)$ for some stochastic kernel $\nu_{\ell,j}$ and $f_{\ell,j}={\rm CE}\left\{ \nu_{\ell,j}\right\}$, as per the derivation in Appendix \ref{['sec:multilayer-derivation']}.
  • Figure 5: The performance of five $5$-layers fully-connected networks on isotropic Gaussians classification, plotted against training iteration. Here for each network, $n=50,100,200,400,800$ respectively, $\sigma$ is the ReLU, and the learning rate $\alpha=0.001.$ Top row: we initialize with $\tau_{1}=\sqrt{2}$, $\mu_{2}=1$, $\tau_{2}=0.1$, $\mu_{3}=0$ and $\tau_{3}=0$. Bottom row: aside from the same initialization (solid lines), we perform another initialization that differs by $\tau_{2}=3$ (dotted lines).
  • ...and 7 more figures