Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Mufan Bill Li; Mihai Nica

Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Mufan Bill Li, Mihai Nica

TL;DR

This work studies unshaped neural networks through differential-equation scaling limits, establishing two key results: (i) an infinite-depth-and-width limit at initialization that matches the shaped-limit behavior via a ResNet-like residual scaling $d^{-1/2}$ and an MLP with depth and width in a shaped regime, and (ii) a first-order correction for unshaped ReLU MLP correlations that yields an SDE with a singularity at the initial time. It reveals a close connection between shaping activations and residual architectures, showing that both approaches can produce the same covariance drift in the limit, while the unshaped regime exposes a distinct, singular stochastic dynamics that governs correlation decay. The findings link shaping and normalization perspectives, offer a roadmap for analyzing training dynamics in infinite-depth regimes, and suggest that weakening nonlinearities—whether via shaping or residual scaling—can stabilize training and enable feature learning without normalization. Overall, the paper provides a rigorous bridge between shaped and unshaped architectures through differential-equation limits, with implications for understanding training behavior and normalization-influenced performance.

Abstract

Recent analyses of neural networks with shaped activations (i.e. the activation function is scaled as the network size grows) have led to scaling limits described by differential equations. However, these results do not a priori tell us anything about "ordinary" unshaped networks, where the activation is unchanged as the network size grows. In this article, we find similar differential equation based asymptotic characterization for two types of unshaped networks. Firstly, we show that the following two architectures converge to the same infinite-depth-and-width limit at initialization: (i) a fully connected ResNet with a $d^{-1/2}$ factor on the residual branch, where $d$ is the network depth. (ii) a multilayer perceptron (MLP) with depth $d \ll$ width $n$ and shaped ReLU activation at rate $d^{-1/2}$. Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if $ρ_\ell$ is the correlation at layer $\ell$, then $q_t = \ell^2 (1 - ρ_\ell)$ with $t = \frac{\ell}{n}$ converges to an SDE with a singularity at $t=0$. These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.

Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

TL;DR

and an MLP with depth and width in a shaped regime, and (ii) a first-order correction for unshaped ReLU MLP correlations that yields an SDE with a singularity at the initial time. It reveals a close connection between shaping activations and residual architectures, showing that both approaches can produce the same covariance drift in the limit, while the unshaped regime exposes a distinct, singular stochastic dynamics that governs correlation decay. The findings link shaping and normalization perspectives, offer a roadmap for analyzing training dynamics in infinite-depth regimes, and suggest that weakening nonlinearities—whether via shaping or residual scaling—can stabilize training and enable feature learning without normalization. Overall, the paper provides a rigorous bridge between shaped and unshaped architectures through differential-equation limits, with implications for understanding training behavior and normalization-influenced performance.

Abstract

factor on the residual branch, where

is the network depth. (ii) a multilayer perceptron (MLP) with depth

width

and shaped ReLU activation at rate

. Secondly, for an unshaped MLP at initialization, we derive the first order asymptotic correction to the layerwise correlation. In particular, if

is the correlation at layer

, then

with

converges to an SDE with a singularity at

. These results together provide a connection between shaped and unshaped network architectures, and opens up the possibility of studying the effect of normalization methods and how it connects with shaping activation functions.

Paper Structure (12 sections, 14 theorems, 83 equations, 1 figure, 1 table)

This paper contains 12 sections, 14 theorems, 83 equations, 1 figure, 1 table.

Introduction
Related Work
Background on Shaped Networks and ResNets
Shaped Limit of Neural Networks
ODE Limit of Residual Networks
An Alternative Shaped Limit for $p \in (0, \frac{1}{2})$
Precise Results
An SDE for the Unshaped ReLU MLP
Full Derivation
Discussion
Background on Markov Chain Convergence to SDEs
Technical Lemmas for Shaped Activations

Key Result

Theorem 2.1

Let $p=\frac{1}{2}$. Then in the limit as $d,n\to\infty, \frac{d}{n} \to T > 0$, and $\varphi_s$ defined as above, we have that the upper triangular entries of $V_{\lfloor tn \rfloor}$ (flattened to a vector) converges to the following SDE weakly where $\Sigma(V)|_{\alpha\beta,\gamma\delta} = V^{\alpha\gamma} V^{\beta\delta} + V^{\alpha\delta} V^{\beta\gamma}$, and if $\varphi$ is a ReLU-like act

Figures (1)

Figure 1: Empirical distribution of the transformed correlation $r_t = \log( \ell^2( 1 - \rho_\ell ) )$ for an unshaped ReLU MLP, SDE sample density computed via kernel density estimation. Simulated with $n = d = 150, \rho_0 = 0.3, r_0 = \log(1 - \rho_0) = \log(0.7)$, SDE step size $10^{-2}$, and $2^{13}$ samples.

Theorems & Definitions (23)

Theorem 2.1: Theorem 3.2 and 3.9 of li2022neural, Informal
Theorem 2.2: Theorem 2 of hayou2023width, Informal
Remark 3.1
Lemma 3.2: Covariance Markov Chain for the Shaped MLP
proof
Remark 3.3
Proposition 3.4: Covariance ODE for the Shaped ReLU MLP
proof
Theorem 4.1: Rescaled Correlation
proof : Proof of \ref{['thm:rescaled_corr']}
...and 13 more

Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

TL;DR

Abstract

Differential Equation Scaling Limits of Shaped and Unshaped Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (23)