Table of Contents
Fetching ...

Proportional infinite-width infinite-depth limit for deep linear neural networks

Federico Bassetti, Lucia Ladelli, Pietro Rotondo

TL;DR

This work rigorously characterizes, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians, yielding a non-Gaussian distribution that retains correlations between outputs.

Abstract

We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process, known as the Neural Network Gaussian Process. However, this Gaussian limit sacrifices descriptive power, as it lacks the ability to learn dependent features and produce output correlations that reflect observed labels. Motivated by these limitations, we explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs. Our contribution extends previous works by rigorously characterizing, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians.

Proportional infinite-width infinite-depth limit for deep linear neural networks

TL;DR

This work rigorously characterizes, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians, yielding a non-Gaussian distribution that retains correlations between outputs.

Abstract

We study the distributional properties of linear neural networks with random parameters in the context of large networks, where the number of layers diverges in proportion to the number of neurons per layer. Prior works have shown that in the infinite-width regime, where the number of neurons per layer grows to infinity while the depth remains fixed, neural networks converge to a Gaussian process, known as the Neural Network Gaussian Process. However, this Gaussian limit sacrifices descriptive power, as it lacks the ability to learn dependent features and produce output correlations that reflect observed labels. Motivated by these limitations, we explore the joint proportional limit in which both depth and width diverge but maintain a constant ratio, yielding a non-Gaussian distribution that retains correlations between outputs. Our contribution extends previous works by rigorously characterizing, for linear activation functions, the limiting distribution as a nontrivial mixture of Gaussians.

Paper Structure

This paper contains 17 sections, 14 theorems, 134 equations.

Key Result

Proposition 2.1

Let $f_{L}(\mathbf{X}| \theta)$ be the outputs of a fully-connected linear network under the prior lawofW. If $\min(N_\ell:\ell=1,\dots,L) >D$ and $\lambda^*_L:=\lambda_0 \dots \lambda_L$, then where $V^{\ell}$ are any $D \times D$ independent random matrices such that $Q^\ell= V^{\ell} (V^{\ell})^\top$ has a Wishart distribution with $N_\ell$ degrees of freedom and scale matrix $\frac{1}{N_\ell}

Theorems & Definitions (21)

  • Proposition 2.1: BaRoetal
  • Proposition 3.1: the case $a=0$ and $D \geq 1$
  • Proposition 3.2: the case $a>0$ and $D \geq 1$
  • Proposition 3.3
  • Proposition 3.4
  • Remark 1
  • Lemma 4.1
  • proof
  • Lemma 4.2
  • proof
  • ...and 11 more