Table of Contents
Fetching ...

Large deviations of one-hidden-layer neural networks

Christian Hirsch, Daniel Willhalm

TL;DR

This work establishes a large-deviation framework for the weight evolution in SGD-driven one-hidden-layer neural networks as both the hidden-layer size and training horizon grow. It develops a quenched LDP for the empirical weight trajectory under initial conditioning and derives an annealed LDP by tilting the initial data, with rate functions expressed via relative entropy and a tilted data-evolution cost. The limiting dynamics are characterized by a McKean-Vlasov-type measure-valued evolution that reduces to a weak solution of a deterministic integral equation in the limit, and the rate function is shown to be good under suitable moment and support conditions. A key outcome is the weak LLN and conditions ensuring uniqueness of the limiting trajectory, enabling potential applications to rare-event analysis and importance sampling for SGD in shallow networks.

Abstract

We study large deviations in the context of stochastic gradient descent for one-hidden-layer neural networks with quadratic loss. We derive a quenched large deviation principle, where we condition on an initial weight measure, and an annealed large deviation principle for the empirical weight evolution during training when letting the number of neurons and the number of training iterations simultaneously tend to infinity. The weight evolution is treated as an interacting dynamic particle system. The distinctive aspect compared to prior work on interacting particle systems lies in the discrete particle updates, simultaneously with a growing number of particles.

Large deviations of one-hidden-layer neural networks

TL;DR

This work establishes a large-deviation framework for the weight evolution in SGD-driven one-hidden-layer neural networks as both the hidden-layer size and training horizon grow. It develops a quenched LDP for the empirical weight trajectory under initial conditioning and derives an annealed LDP by tilting the initial data, with rate functions expressed via relative entropy and a tilted data-evolution cost. The limiting dynamics are characterized by a McKean-Vlasov-type measure-valued evolution that reduces to a weak solution of a deterministic integral equation in the limit, and the rate function is shown to be good under suitable moment and support conditions. A key outcome is the weak LLN and conditions ensuring uniqueness of the limiting trajectory, enabling potential applications to rare-event analysis and importance sampling for SGD in shallow networks.

Abstract

We study large deviations in the context of stochastic gradient descent for one-hidden-layer neural networks with quadratic loss. We derive a quenched large deviation principle, where we condition on an initial weight measure, and an annealed large deviation principle for the empirical weight evolution during training when letting the number of neurons and the number of training iterations simultaneously tend to infinity. The weight evolution is treated as an interacting dynamic particle system. The distinctive aspect compared to prior work on interacting particle systems lies in the discrete particle updates, simultaneously with a growing number of particles.
Paper Structure (18 sections, 23 theorems, 152 equations, 1 figure)

This paper contains 18 sections, 23 theorems, 152 equations, 1 figure.

Key Result

Theorem 2

Assume that CON, DCOMP, UNQ and WCOMP are satisfied. Then, the family of empirical measures $( \theta^n)_{n \geqslant 1}$ satisfies the LDP in $\mathcal{P}(\mathcal{X})$ with respect to the weak topology with rate function for $\eta\in\mathcal{P}(\mathcal{X})$, where

Figures (1)

  • Figure 1: Architecture of the considered one-hidden-layer neural network

Theorems & Definitions (46)

  • Remark 1
  • Theorem 2: Annealed LDP for $\theta^n$
  • Theorem 3: Compact data support implies uniqueness
  • Corollary 4: Weak LLN for $\theta^n$
  • Theorem 5: Quenched LDP for $\eta^n$
  • Proposition 6: Representation formula
  • proof
  • Lemma 7: Tightness of $\tilde{\pi}^n$
  • Lemma 8: Tightness of $\bar{\eta}^n$
  • Lemma 9: Weak limit satisfies SDE
  • ...and 36 more