Implicit regularization of deep residual networks towards neural ODEs

Pierre Marion; Yu-Han Wu; Michael E. Sander; Gérard Biau

Implicit regularization of deep residual networks towards neural ODEs

Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau

TL;DR

This work establishes that deep residual networks trained with gradient flow exhibit implicit regularization toward neural ODEs. By using depth-appropriate scaling and initialization, the authors prove a finite-time large-depth limit in which ResNets converge to a neural ODE with time-dependent kernels; under a Polyak–Łojasiewicz condition and linear overparameterization, they obtain long-time convergence to a global minimum and a double limit where depth and training time yield an interpolating neural ODE. Generalizations to broader architectures and initialization schemes are discussed, supported by numerical experiments on synthetic and real data showing the emergence of neural-ODE structure when weights are smooth and initialized via weight tying. The results provide a solid mathematical link between discrete ResNets and continuous-depth models, with implications for understanding implicit regularization, generalization, and memory-efficient training. Overall, the paper offers a principled framework for viewing training dynamics as guiding residual networks toward continuous-depth representations that interpolate data while highlighting regimes where this correspondence robustly holds.

Abstract

Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.

Implicit regularization of deep residual networks towards neural ODEs

TL;DR

Abstract

Paper Structure (58 sections, 21 theorems, 186 equations, 4 figures, 1 table)

This paper contains 58 sections, 21 theorems, 186 equations, 4 figures, 1 table.

Introduction
Contributions.
Related work
Deep residual networks and neural ODEs.
Long-time convergence of wide residual networks.
Implicit regularization.
Definitions and notation
Residual network.
Data and loss.
Initialization.
Training algorithm.
Neural ODE.
Large-depth limit of residual networks
Clipped gradient flow and finite training time
Convergence in the long-time limit for wide networks
...and 43 more sections

Key Result

Proposition 1

The (clipped) gradient flow eq:clipped-gf has a unique solution for all $t \geqslant 0$.

Figures (4)

Figure 1: Left: $1/L$ convergence of the maximum distance between two successive weight matrices $\mathrm{max}_{1 \leqslant k \leqslant L,t \in [0,T]}(\|Z^L_k(t) - Z^L_{k+1}(t)\|_F)$. Right: uniform convergence of $\mathcal{Z}^L$ to its large-depth limit $\mathcal{Z}$. Here, for a matrix-valued function $f$, $\|f\|$ denotes $(\int_0^1 \|f(s)\|^2_F ds)^{1/2}$.
Figure 2: Left: Randomly-chosen entry of the weight matrices across layers ($x$-axis) for various training times $t$ (lighter color indicates higher training time). Right: Loss against training time.
Figure 3: Random entries of the convolutions across layers ($x$-axis) after training. Left: Weight-tied initialization leads to smooth weights. Right: i.i.d. initialization leads to non-smooth weights.
Figure 4: Average (across layers) of the Frobenius norm of the difference between two successive weights in the convolutional ResNets after training on CIFAR-10, depending on the initialization strategy.

Theorems & Definitions (36)

Proposition 1
Proposition 2
Proposition 3
Theorem 4
Definition 1
Proposition 5
Theorem 6
Proposition 7
proof
Remark 1
...and 26 more

Implicit regularization of deep residual networks towards neural ODEs

TL;DR

Abstract

Implicit regularization of deep residual networks towards neural ODEs

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (36)