Table of Contents
Fetching ...

Les Houches Lectures on Deep Learning at Large & Infinite Width

Yasaman Bahri, Boris Hanin, Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

TL;DR

The Les Houches lectures explore how deep neural networks behave in the large- and infinite-width limits, linking training dynamics to Gaussian processes and kernel methods via the Neural Tangent Kernel. The first part establishes GP priors and kernel recursions for various architectures, then shows how gradient-based training becomes kernel regression with a fixed NTK in the infinite-width limit. The subsequent lectures develop finite-width corrections through a BBGKY-like function-space hierarchy, revealing how the width-to-depth ratio L/n controls non-Gaussian fluctuations and feature learning, including the catapult phenomenon at large learning rates. A complementary line focuses on exact finite-width statistics for ReLU networks at a single input, predicting log-normal Jacobians and exponential NTK fluctuations governed by an inverse temperature β ≈ 5L/n. Collectively, the work provides a coherent framework connecting initialization, training dynamics, and finite-width effects, with implications for hyperparameter tuning and understanding when neural networks behave as kernel methods versus learning rich representations.

Abstract

These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural networks, linear models, kernels, and Gaussian processes that arise in the infinite-width limit; and perturbative and non-perturbative treatments of large but finite-width networks, at initialization and after training.

Les Houches Lectures on Deep Learning at Large & Infinite Width

TL;DR

The Les Houches lectures explore how deep neural networks behave in the large- and infinite-width limits, linking training dynamics to Gaussian processes and kernel methods via the Neural Tangent Kernel. The first part establishes GP priors and kernel recursions for various architectures, then shows how gradient-based training becomes kernel regression with a fixed NTK in the infinite-width limit. The subsequent lectures develop finite-width corrections through a BBGKY-like function-space hierarchy, revealing how the width-to-depth ratio L/n controls non-Gaussian fluctuations and feature learning, including the catapult phenomenon at large learning rates. A complementary line focuses on exact finite-width statistics for ReLU networks at a single input, predicting log-normal Jacobians and exponential NTK fluctuations governed by an inverse temperature β ≈ 5L/n. Collectively, the work provides a coherent framework connecting initialization, training dynamics, and finite-width effects, with implications for hyperparameter tuning and understanding when neural networks behave as kernel methods versus learning rich representations.

Abstract

These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural networks, linear models, kernels, and Gaussian processes that arise in the infinite-width limit; and perturbative and non-perturbative treatments of large but finite-width networks, at initialization and after training.
Paper Structure (58 sections, 6 theorems, 215 equations, 2 figures)

This paper contains 58 sections, 6 theorems, 215 equations, 2 figures.

Key Result

Theorem 4.1

Fix $L,n_0,n_{L+1},\sigma$. Suppose that at the start of training we initialize as in E:init.

Figures (2)

  • Figure 1: Phase diagram in the $(\sigma_b^2, \sigma_w^2)$ plane for fixed points of the NNGP recursion relationship with nonlinearity $\phi = \tanh$, showing ordered and chaotic phases separated by a critical line. Figure reproduced from bahri2020; see also schoenholz2017.
  • Figure 2: Partial Phase Diagram for Fully Connected Networks with NTK Initialization

Theorems & Definitions (11)

  • Definition 1: Gaussian process
  • proof : Proof (informal).
  • Theorem 4.1: GP + NTK Regime for Networks at Fixed Depth and Infinite Width
  • Theorem 4.2
  • Proposition 4.3
  • Lemma 4.4
  • proof
  • Theorem 5.1: Meta-Claim
  • Proposition 5.2: Exact Matrix Model Underlying Random ReLU Networks
  • proof : Sketch of Proof
  • ...and 1 more