Table of Contents
Fetching ...

Generalization of Scaled Deep ResNets in the Mean-Field Regime

Yihang Chen, Fanghui Liu, Yiping Lu, Grigorios G. Chrysos, Volkan Cevher

TL;DR

This paper investigates scaled deep ResNets in the mean-field regime, where depth and width go to infinity and gradient flow is described by a PDE. It develops a time-varying Gram-matrix framework, establishes a global lower bound on the Gram eigenvalue under mean-field dynamics, and analyzes the KL divergence between initialized and trained parameter distributions to derive a uniform generalization bound via Rademacher complexity. The main results show linear convergence of the training loss and bounded KL movement, yielding an overall generalization rate on the order of ${O}(1/\sqrt{n})$ that matches NTK-style bounds while operating beyond the lazy training regime. This work advances understanding of generalization in deep ResNets by connecting non-stationary kernel evolution with distributional parameter learning. The findings have implications for understanding feature learning and the fundamental properties of deep networks in regimes where parameters move significantly during training.

Abstract

Despite the widespread empirical success of ResNet, the generalization properties of deep ResNet are rarely explored beyond the lazy training regime. In this work, we investigate \emph{scaled} ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime. To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional time-invariant Gram matrix employed in the lazy training regime to a time-variant, distribution-dependent version. To this end, we provide a global lower bound on the minimum eigenvalue of the Gram matrix under the mean-field regime. Besides, for the traceability of the dynamic of Kullback-Leibler (KL) divergence, we establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution. Finally, we build the uniform convergence for generalization bound via Rademacher complexity. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks.

Generalization of Scaled Deep ResNets in the Mean-Field Regime

TL;DR

This paper investigates scaled deep ResNets in the mean-field regime, where depth and width go to infinity and gradient flow is described by a PDE. It develops a time-varying Gram-matrix framework, establishes a global lower bound on the Gram eigenvalue under mean-field dynamics, and analyzes the KL divergence between initialized and trained parameter distributions to derive a uniform generalization bound via Rademacher complexity. The main results show linear convergence of the training loss and bounded KL movement, yielding an overall generalization rate on the order of that matches NTK-style bounds while operating beyond the lazy training regime. This work advances understanding of generalization in deep ResNets by connecting non-stationary kernel evolution with distributional parameter learning. The findings have implications for understanding feature learning and the fundamental properties of deep networks in regimes where parameters move significantly during training.

Abstract

Despite the widespread empirical success of ResNet, the generalization properties of deep ResNet are rarely explored beyond the lazy training regime. In this work, we investigate \emph{scaled} ResNet in the limit of infinitely deep and wide neural networks, of which the gradient flow is described by a partial differential equation in the large-neural network limit, i.e., the \emph{mean-field} regime. To derive the generalization bounds under this setting, our analysis necessitates a shift from the conventional time-invariant Gram matrix employed in the lazy training regime to a time-variant, distribution-dependent version. To this end, we provide a global lower bound on the minimum eigenvalue of the Gram matrix under the mean-field regime. Besides, for the traceability of the dynamic of Kullback-Leibler (KL) divergence, we establish the linear convergence of the empirical error and estimate the upper bound of the KL divergence over parameters distribution. Finally, we build the uniform convergence for generalization bound via Rademacher complexity. Our results offer new insights into the generalization ability of deep ResNet beyond the lazy training regime and contribute to advancing the understanding of the fundamental properties of deep neural networks.
Paper Structure (30 sections, 27 theorems, 184 equations, 1 figure)

This paper contains 30 sections, 27 theorems, 184 equations, 1 figure.

Key Result

Theorem 4.1

The training dynamics of $\widehat{L}({\tau_t,\nu_t})$ can be written as:

Figures (1)

  • Figure 1: Left: "Two Spirals" datasets. Right: $L_{0-1}$ test error v.s. the training dataset size $n_{\rm train}$ (blue), OLS fitted line (red) which is close to the $\mathcal{O}(1/n)$ rate with $p$-value $10^{-5}$.

Theorems & Definitions (52)

  • Theorem 4.1
  • Proposition 4.2
  • Lemma 4.3
  • Lemma 4.4
  • Definition 4.5
  • Lemma 4.6
  • Theorem 4.7
  • Lemma 4.8
  • Theorem 4.9: Generalization
  • Lemma B.1: 2-Wasserstein continuity for functions of quadratic growth, Proposition 1 in polyanskiy2016wasserstein
  • ...and 42 more