Implicit regularization of deep residual networks towards neural ODEs
Pierre Marion, Yu-Han Wu, Michael E. Sander, Gérard Biau
TL;DR
This work establishes that deep residual networks trained with gradient flow exhibit implicit regularization toward neural ODEs. By using depth-appropriate scaling and initialization, the authors prove a finite-time large-depth limit in which ResNets converge to a neural ODE with time-dependent kernels; under a Polyak–Łojasiewicz condition and linear overparameterization, they obtain long-time convergence to a global minimum and a double limit where depth and training time yield an interpolating neural ODE. Generalizations to broader architectures and initialization schemes are discussed, supported by numerical experiments on synthetic and real data showing the emergence of neural-ODE structure when weights are smooth and initialized via weight tying. The results provide a solid mathematical link between discrete ResNets and continuous-depth models, with implications for understanding implicit regularization, generalization, and memory-efficient training. Overall, the paper offers a principled framework for viewing training dynamics as guiding residual networks toward continuous-depth representations that interpolate data while highlighting regimes where this correspondence robustly holds.
Abstract
Residual neural networks are state-of-the-art deep learning models. Their continuous-depth analog, neural ordinary differential equations (ODEs), are also widely used. Despite their success, the link between the discrete and continuous models still lacks a solid mathematical foundation. In this article, we take a step in this direction by establishing an implicit regularization of deep residual networks towards neural ODEs, for nonlinear networks trained with gradient flow. We prove that if the network is initialized as a discretization of a neural ODE, then such a discretization holds throughout training. Our results are valid for a finite training time, and also as the training time tends to infinity provided that the network satisfies a Polyak-Lojasiewicz condition. Importantly, this condition holds for a family of residual networks where the residuals are two-layer perceptrons with an overparameterization in width that is only linear, and implies the convergence of gradient flow to a global minimum. Numerical experiments illustrate our results.
