Table of Contents
Fetching ...

On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

Jinshu Huang, Mingfei Sun, Chunlin Wu

TL;DR

This work establishes generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit.

Abstract

Deep neural networks (DNNs) have significantly advanced machine learning, with model depth playing a central role in their successes. The dynamical system modeling approach has recently emerged as a powerful framework, offering new mathematical insights into the structure and learning behavior of DNNs. In this work, we establish generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit. The resulting bounds are of order $O(1/\sqrt{S})$ with respect to the number of training samples $S$, and include a structure-dependent negative term, yielding depth-uniform and asymptotic generalization bounds under milder assumptions. These findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.

On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

TL;DR

This work establishes generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit.

Abstract

Deep neural networks (DNNs) have significantly advanced machine learning, with model depth playing a central role in their successes. The dynamical system modeling approach has recently emerged as a powerful framework, offering new mathematical insights into the structure and learning behavior of DNNs. In this work, we establish generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit. The resulting bounds are of order with respect to the number of training samples , and include a structure-dependent negative term, yielding depth-uniform and asymptotic generalization bounds under milder assumptions. These findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.
Paper Structure (16 sections, 6 theorems, 66 equations, 4 figures, 3 tables)

This paper contains 16 sections, 6 theorems, 66 equations, 4 figures, 3 tables.

Key Result

Proposition 3.2

Let $\mathcal{G}$ be a set of functions mapping $\mathbb{R}^N$ to $\mathbb{R}$. Consider a sample set $\mathcal{Z}=\left\{\mathrm{z}_{(1)}, \mathrm{z}_{(2)}, \cdots, \mathrm{z}_{(S)}\right\}$ from some distribution $\mathfrak{B}$ and an activation function $\psi \in \mathscr{A}(\mathbb{R})$ of the f where $\mathrm{Lip}_{\psi} = \max\{\mathrm{Lip}_{\phi_1}, \mathrm{Lip}_{\phi_2}\}$, $\mathrm{Lip}_{

Figures (4)

  • Figure 1: Average testing-training loss gap of ResNets with $(T, L) = (6, 6), (6,24)$ and $(T, L) = (8, 8), (8,32)$ at last 10 epochs versus training sample size $S$, along with least-squares fitting using $h_{\mu}(S) = \mu / \sqrt{S}$. Plots (a1)-(a4) show results of ResNets on MNIST. Plots (b1)-(b4) show results of ResNets on CIFAR10. Plots (c1)-(c4) show results of ResNets on CIFAR100. The generalization gap decreases with increasing $S$, closely matching the theoretical rate of $O(1/\sqrt{S})$.
  • Figure 2: Training loss of ResNets with varying layer numbers in the deep-layer limit regime. The first row corresponds to $T=6$ with $L = 3, 6, 12, 24$, and the second row corresponds to $T=8$ with $L = 4, 8, 16, 32$. Across all datasets (MNIST, CIFAR10, CIFAR100), the training loss exhibits convergence behavior as the number of layers increases, supporting the depth-stability predicted by our theoretical analysis.
  • Figure 3: Top-1 image classification accuracy of ResNets with varying layer numbers in the deep-layer limit regime. The first row corresponds to $T=6$ with $L = 3, 6, 12, 24$, and the second row corresponds to $T=8$ with $L = 4, 8, 16, 32$. Across all datasets (MNIST, CIFAR10, CIFAR100), the testing accuracy exhibits convergence behavior as the number of layers increases, supporting the depth-stability predicted by our theoretical analysis.
  • Figure 4: Empirical generalization gap $\hat{\mathfrak{R}}_{\rm test}-\hat{\mathfrak{R}}_{\rm train}$ versus training epoch for ResNets with different $(T,L)$ configurations. The first row corresponds to MNIST, the second row to CIFAR10, and the third row to CIFAR100. ResNets equipped with activation functions admitting a structured decomposition exhibit smaller generalization gaps on CIFAR10 and CIFAR100 in the later stage of training.

Theorems & Definitions (27)

  • Definition 3.1
  • Proposition 3.2
  • Proof 1
  • Remark 3.3
  • Example 3.1
  • Proof 2
  • Remark 3.4
  • Theorem 3.5
  • Proof 3
  • Remark 3.6
  • ...and 17 more