On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

Jinshu Huang; Mingfei Sun; Chunlin Wu

On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

Jinshu Huang, Mingfei Sun, Chunlin Wu

TL;DR

This work establishes generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit.

Abstract

Deep neural networks (DNNs) have significantly advanced machine learning, with model depth playing a central role in their successes. The dynamical system modeling approach has recently emerged as a powerful framework, offering new mathematical insights into the structure and learning behavior of DNNs. In this work, we establish generalization error bounds for both discrete- and continuous-time residual networks (ResNets) by combining Rademacher complexity, flow maps of dynamical systems, and the convergence behavior of ResNets in the deep-layer limit. The resulting bounds are of order $O(1/\sqrt{S})$ with respect to the number of training samples $S$, and include a structure-dependent negative term, yielding depth-uniform and asymptotic generalization bounds under milder assumptions. These findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.

On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

TL;DR

Abstract

with respect to the number of training samples

, and include a structure-dependent negative term, yielding depth-uniform and asymptotic generalization bounds under milder assumptions. These findings provide a unified understanding of generalization across both discrete- and continuous-time ResNets, helping to close the gap in both the order of sample complexity and assumptions between the discrete- and continuous-time settings.

Paper Structure (16 sections, 6 theorems, 66 equations, 4 figures, 3 tables)

This paper contains 16 sections, 6 theorems, 66 equations, 4 figures, 3 tables.

Introduction
Notations and problem formulation
Notations
Residual neural network and its dynamical system modeling
The problem of generalization for discrete- and continuous-time ResNets
Main results and application
Proofs
Proof of Proposition \ref{['proposition: Rademacher complexity property of ac function']}
Proof of Theorem \ref{['theorem: Uniform generalization error bound for discrete-time ResNet']}
Proof of Theorem \ref{['theorem: Uniform generalization error bound for continuous-time ResNet']}
Experiments
A test on the influence of training sample size $S$
A test on the layer number of ResNets
Influence of activation functions with structured decomposition
Conclusion
...and 1 more sections

Key Result

Proposition 3.2

Let $\mathcal{G}$ be a set of functions mapping $\mathbb{R}^N$ to $\mathbb{R}$. Consider a sample set $\mathcal{Z}=\left\{\mathrm{z}_{(1)}, \mathrm{z}_{(2)}, \cdots, \mathrm{z}_{(S)}\right\}$ from some distribution $\mathfrak{B}$ and an activation function $\psi \in \mathscr{A}(\mathbb{R})$ of the f where $\mathrm{Lip}_{\psi} = \max\{\mathrm{Lip}_{\phi_1}, \mathrm{Lip}_{\phi_2}\}$, $\mathrm{Lip}_{

Figures (4)

Figure 1: Average testing-training loss gap of ResNets with $(T, L) = (6, 6), (6,24)$ and $(T, L) = (8, 8), (8,32)$ at last 10 epochs versus training sample size $S$, along with least-squares fitting using $h_{\mu}(S) = \mu / \sqrt{S}$. Plots (a1)-(a4) show results of ResNets on MNIST. Plots (b1)-(b4) show results of ResNets on CIFAR10. Plots (c1)-(c4) show results of ResNets on CIFAR100. The generalization gap decreases with increasing $S$, closely matching the theoretical rate of $O(1/\sqrt{S})$.
Figure 2: Training loss of ResNets with varying layer numbers in the deep-layer limit regime. The first row corresponds to $T=6$ with $L = 3, 6, 12, 24$, and the second row corresponds to $T=8$ with $L = 4, 8, 16, 32$. Across all datasets (MNIST, CIFAR10, CIFAR100), the training loss exhibits convergence behavior as the number of layers increases, supporting the depth-stability predicted by our theoretical analysis.
Figure 3: Top-1 image classification accuracy of ResNets with varying layer numbers in the deep-layer limit regime. The first row corresponds to $T=6$ with $L = 3, 6, 12, 24$, and the second row corresponds to $T=8$ with $L = 4, 8, 16, 32$. Across all datasets (MNIST, CIFAR10, CIFAR100), the testing accuracy exhibits convergence behavior as the number of layers increases, supporting the depth-stability predicted by our theoretical analysis.
Figure 4: Empirical generalization gap $\hat{\mathfrak{R}}_{\rm test}-\hat{\mathfrak{R}}_{\rm train}$ versus training epoch for ResNets with different $(T,L)$ configurations. The first row corresponds to MNIST, the second row to CIFAR10, and the third row to CIFAR100. ResNets equipped with activation functions admitting a structured decomposition exhibit smaller generalization gaps on CIFAR10 and CIFAR100 in the later stage of training.

Theorems & Definitions (27)

Definition 3.1
Proposition 3.2
Proof 1
Remark 3.3
Example 3.1
Proof 2
Remark 3.4
Theorem 3.5
Proof 3
Remark 3.6
...and 17 more

On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

TL;DR

Abstract

On the Generalization Behavior of Deep Residual Networks From a Dynamical System Perspective

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (27)