Uniform-in-time concentration in two-layer neural networks via transportation inequalities

Arnaud Guillin; Boris Nectoux; Paul Stos

Uniform-in-time concentration in two-layer neural networks via transportation inequalities

Arnaud Guillin, Boris Nectoux, Paul Stos

TL;DR

This work quantifies the discrepancy between the predictions of a two-layer neural network trained by stochastic gradient descent and their mean-field limit, for quadratic loss and ridge regularization and proves uniform-in-time concentration of the empirical parameter measure around its mean-field limit in the Wasserstein distance W 1.

Abstract

We quantify, uniformly over time and with high probability, the discrepancy between the predictions of a two-layer neural network trained by stochastic gradient descent (SGD) and their mean-field limit, for quadratic loss and ridge regularization. As a key ingredient, we establish T p transportation inequalities (p $\in$ {1, 2}) for the law of the SGD parameters, with explicit constants independent of the iteration index. We then prove uniform-in-time concentration of the empirical parameter measure around its mean-field limit in the Wasserstein distance W 1 , and we translate these bounds into prediction-error estimates against a fixed test function $Φ$. We also derive analogous concentration bounds in the sliced-Wasserstein distance SW 1 , leading to dimension-free rates.

Uniform-in-time concentration in two-layer neural networks via transportation inequalities

TL;DR

Abstract

{1, 2}) for the law of the SGD parameters, with explicit constants independent of the iteration index. We then prove uniform-in-time concentration of the empirical parameter measure around its mean-field limit in the Wasserstein distance W 1 , and we translate these bounds into prediction-error estimates against a fixed test function

. We also derive analogous concentration bounds in the sliced-Wasserstein distance SW 1 , leading to dimension-free rates.

Paper Structure (17 sections, 13 theorems, 116 equations)

This paper contains 17 sections, 13 theorems, 116 equations.

Introduction
Related works.
Notation.
Main results
Assumptions
Transportation inequalities for the SGD dynamics
Uniform bias decay and concentration around mean-field
Application to the network output.
Proofs
SGD dynamics
Uniform bias decay
Proof of Propositions \ref{['prop:FG']}-\ref{['prop:FG2']}
Proof of Proposition \ref{['prop:PoC']}
Extension to unbounded activations
Localization assumptions
...and 2 more sections

Key Result

Proposition 1

Fix $p\in\{1,2\}$. Assume assump:A1--assump:A3, assump:A4p, and $L_N < 1$. Then for all $k\in\mathbb{N}$, $\nu_k \in T_p(C_N^{(p)})$ on $(\mathcal{E},\|\cdot\|_p)$, with the explicit constants

Theorems & Definitions (25)

Proposition 1
Remark
Corollary 1
Proposition 2
Remark
Theorem 1
Lemma 1
proof
proof : Proof of \ref{['prop:Tp']}
Proposition 3
...and 15 more

Uniform-in-time concentration in two-layer neural networks via transportation inequalities

TL;DR

Abstract

Uniform-in-time concentration in two-layer neural networks via transportation inequalities

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (25)