Field theory for optimal signal propagation in ResNets

Kirsten Fischer; David Dahmen; Moritz Helias

Field theory for optimal signal propagation in ResNets

Kirsten Fischer, David Dahmen, Moritz Helias

TL;DR

We address how finite-width residual networks propagate signals and how to choose the residual scaling to maximize propagation. We develop a field-theoretic description of the Bayesian network prior, recover the Neural Network Gaussian Process in the infinite-width limit, and obtain a next-to-leading-order correction—the response function—that quantifies sensitivity to input variations. The analysis shows the response has a distinct maximum as a function of the residual scaling $\rho$, with the optimal $\rho^*$ depending mainly on depth as $\rho^* \sim 1/\sqrt{L}$ and only weakly on other hyperparameters, explaining universality across architectures. For trained networks, using Langevin dynamics demonstrates that operating at or near $\rho^*$ enhances data adaptation and generalization, consistent with signal-propagation driven training. The framework offers a systematic finite-size theory for ResNets and can be extended to broader architectures.

Abstract

Residual networks have significantly better trainability and thus performance than feed-forward networks at large depth. Introducing skip connections facilitates signal propagation to deeper layers. In addition, previous works found that adding a scaling parameter for the residual branch further improves generalization performance. While they empirically identified a particularly beneficial range of values for this scaling parameter, the associated performance improvement and its universality across network hyperparameters yet need to be understood. For feed-forward networks, finite-size theories have led to important insights with regard to signal propagation and hyperparameter tuning. We here derive a systematic finite-size field theory for residual networks to study signal propagation and its dependence on the scaling for the residual branch. We derive analytical expressions for the response function, a measure for the network's sensitivity to inputs, and show that for deep networks the empirically found values for the scaling parameter lie within the range of maximal sensitivity. Furthermore, we obtain an analytical expression for the optimal scaling parameter that depends only weakly on other network hyperparameters, such as the weight variance, thereby explaining its universality across hyperparameters. Overall, this work provides a theoretical framework to study ResNets at finite size.

Field theory for optimal signal propagation in ResNets

TL;DR

, with the optimal

depending mainly on depth as

and only weakly on other hyperparameters, explaining universality across architectures. For trained networks, using Langevin dynamics demonstrates that operating at or near

enhances data adaptation and generalization, consistent with signal-propagation driven training. The framework offers a systematic finite-size theory for ResNets and can be extended to broader architectures.

Abstract

Paper Structure (20 sections, 103 equations, 9 figures)

This paper contains 20 sections, 103 equations, 9 figures.

Introduction
Field Theory of Residual Networks
Network prior in field-theoretic framework
Saddle point approximation yields NNGP
Next-to-leading-order correction yields response function
Relation to linear response theory
Signal propagation and optimal scaling in residual networks
Optimal scaling of the residual branch
Depth scaling dominates optimal scaling
Behavior across full data set
Behavior in trained networks
Discussion
Network prior for multiple inputs
Saddle point approximation
Next-to-leading-order Corrections
...and 5 more sections

Figures (9)

Figure 1: Signal distribution in residual network. (a) Network layer with residual branch and skip connection. The residual branch returns $h\mapsto\mathcal{F}(h)$, the layer passes on $\mathcal{F}(h)+h$ to the next layer. (b) Distribution of the signal $h^{(l)}$ after layer $l$ (solid curves) relative to the dynamic range $\mathcal{V}$ (shaded orange area) of the activation function $\phi=\text{erf}$ (dashed curve). The signal is Gaussian distributed $h^{(l)}\sim\mathcal{N}(0,K^{(l)})$ with variance given by $K^{(l)}$, which depends on the residual scaling parameter $\rho$. For values larger than the optimal scaling $\rho>\rho^{\ast}$, part of the signal is lost in the saturation of the activation function $\phi$ (dark blue). For values smaller than the optimal scaling $\rho<\rho^{\ast}$, the signal is restricted to a small fraction of the dynamic range (light blue) in which the activation function is typically linear. For optimal scaling $\rho=\rho^{\ast}$, the signal optimally utilizes the whole dynamic range $\mathcal{V}$ of the activation function $\phi$ (blue). (c) The response function $\chi^{(l)}$ describes how the variance $K^{(l)}$, corresponding to the diagonal element of the GP kernel, changes to linear order in the perturbation of the input kernel $\delta K^{(0)}$ around its data mean $\langle K^{(0)}\rangle$. The kernel $K^{(l)}$ of the signal distribution can only increase across layers due to the skip connections; its rate of increase is governed by the residual scaling parameter $\rho$. If the signal goes into saturation ($\rho>\rho^{\ast}$) or remains close to zero ($\rho<\rho^{\ast}$), then the overall response of the network output to a change of the input kernel is limited. (d) The output response $\chi^{\text{out}}$ as a function of the residual scaling $\rho$ exhibits a unique maximum that depends on the network depth $L$, yielding a scaling $\rho^{\ast}(L)$ that promotes optimal signal propagation in the network.
Figure 2: Residual kernels $C_{*}^{(l)}$ (a) and the respective response function $\eta^{(l)}$ (b) in ResNets (blue) compared to FFNets (green). In (a) error bars indicate standard error of the mean obtained from simulation over $10^{3}$ network initializations, solid curves show theory values from \ref{['eq:C_l']}. In (b) dots represent simulations over $10^{2}$ input samples and $10^{3}$ network initializations, solid curves show theory values from \ref{['eq:eta']}. Errors are of order $10^{-5}$ and therefore not shown. ResNets exhibit a slower decay over layers $l$ compared to FFNets. Other parameters: $\sigma_{w,\,\text{in}}^{2}=\sigma_{w}^{2}=\sigma_{w,\,\text{out}}^{2}=1.2,$$\sigma_{w,\,\text{in}}^{2}=\sigma_{w}^{2}=\sigma_{w,\,\text{out}}^{2}=1.2,\,\sigma_{b,\,\text{in}}^{2}=\sigma_{b}^{2}=\sigma_{b,\,\text{out}}^{2}=0.2,$$d_{\text{in}}=d_{\text{out}}=100,\,N=500,\,\rho=1$, $\phi=\text{erf}$.
Figure 3: Dependence of (a) kernels $K^{(l)}$ and (b) the respective response function $\chi^{(l)}$ on the residual scaling parameter $\rho$. The residual scaling takes values $\rho\in[1.0,\,0.3,\,0.1]$ (from dark to light). The residual scaling parameter $\rho$ governs the rate of increase in both quantities. Other parameters: $\sigma_{w,\,\text{in}}^{2}=\sigma_{w}^{2}=\sigma_{w,\,\text{out}}^{2}=1.2$, $\sigma_{b,\,\text{in}}^{2}=\sigma_{b}^{2}=\sigma_{b,\,\text{out}}^{2}=0.2$, $d_{\text{in}}=d_{\text{out}}=100$, $N=500$, $\phi=\text{erf}$.
Figure 4: Optimal scaling of the residual branch. Output response $\chi^{\text{out}}$ for (a) diagonal and (b) off-diagonal elements of the network kernel $K_{\alpha\beta}^{(l)}$. Different curves correspond to different network depths $L\in[10,\,50,\,100,\,200]$ (light to dark). All curves exhibit a unique maximum; the residual scaling values $\rho^{\ast}$ with largest response concentrate with increasing depth. (c) Optimal residual scaling $\rho^{\ast}=\mathrm{argmax}(\chi^{\text{out}})$ for diagonal (blue) and off-diagonal (green) elements of the network kernel $K_{\alpha\beta}^{(l)}$. In both cases, these scale with $1/\sqrt{L}$ (gray). Other parameters: input kernel $K^{(0)}=\left(0.050.030.030.05\right)$, $\sigma_{w}^{2}=1.25$, $\sigma_{b}^{2}=0.05$, $d_{\text{in}}=d_{\text{out}}=100$, $N=500$, $\phi=\text{erf}$.
Figure 5: Optimal scalings depend strongly on network depth but weakly on other hyperparameters. We illustrate the weak dependence on the weight variance $\sigma_{w}^{2}$ and bias variance $\sigma_{b}^{2}$ relative to the network depth $L$ for CIFAR-10 for both (a) variances and (b) covariances; samples are either dogs or airplanes. We measure the scaling with maximal output response averaged over all diagonal or all off-diagonal elements of the covariance, $\bar{\rho}_{\alpha\alpha}^{\ast}=\frac{1}{N}\sum_{\alpha}\mathrm{argmax}(\chi_{\alpha\alpha}^{\text{out}})$ or $\bar{\rho}_{\alpha\beta}^{\ast}=\frac{1}{N(N-1)}\sum_{\alpha\neq\beta}\mathrm{argmax}(\chi_{\alpha\beta}^{\text{out}})$. Other parameters: data set size $P=20$, input scale $K^{(0)}=0.05$, $\sigma_{w}^{2}=1.25\,,\sigma_{b}^{2}=0.05,\,d_{\text{in}}=d_{\text{out}}=100,\,N=500$, $\phi=\text{erf}$.
...and 4 more figures

Field theory for optimal signal propagation in ResNets

TL;DR

Abstract

Field theory for optimal signal propagation in ResNets

Authors

TL;DR

Abstract

Table of Contents

Figures (9)