Bayesian Inference with Deep Weakly Nonlinear Networks

Boris Hanin; Alexander Zlokapa

Bayesian Inference with Deep Weakly Nonlinear Networks

Boris Hanin, Alexander Zlokapa

TL;DR

The paper analyzes Bayesian inference with deep fully connected networks that use the shaped activation $\phi(t)=t+\frac{\psi}{3L}t^3$, in a regime where the training set size $P$, input dimension $N_0$, layer widths $N$, and depth $L$ are all large with $P<N_0$. It shows that, at leading order in width, posterior inference reduces to a kernel method with a $\psi$-dependent feature map $x_\psi$, and that first-order $1/N$ corrections introduce data-dependent, cubic feature-learning corrections controlled by the emergent ratio $LP/N$ (an effective posterior depth). The analysis is built on a novel combinatorial model for prior moments expressed via random graphs and a self-loop process, enabling perturbative expansions of the partition function $Z_\beta(x;\tau)$ and the predictive posterior. In the zero-temperature limit, the authors show conditions under which depth enhances model evidence and generalization for certain data spectra (e.g., power laws) and discuss benign overfitting in deep linear vs nonlinear networks, with depth increasingly beneficial when $\alpha<2$ and data are well-aligned. Overall, the work connects kernel methods with data-dependent learning in deep nonlinear networks and provides a controlled framework to quantify how depth and nonlinearity influence Bayesian inference in high-dimensional, overparameterized regimes.

Abstract

We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form $φ(t) = t + ψt^3/L$ is (perturbatively) solvable in the regime where the number of training datapoints $P$ , the input dimension $N_0$, the network layer widths $N$, and the network depth $L$ are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that $P < N_0$. We provide techniques to compute the model evidence and posterior to arbitrary order in $1/N$ and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width $N$ is much larger than the depth $L$ and training set size $P$, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of $ψ$ determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When $LP/N$ is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and $LP/N$ serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks ($ψ=0$) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As $LP/N$ increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.

Bayesian Inference with Deep Weakly Nonlinear Networks

TL;DR

The paper analyzes Bayesian inference with deep fully connected networks that use the shaped activation

, in a regime where the training set size

, input dimension

, layer widths

, and depth

are all large with

. It shows that, at leading order in width, posterior inference reduces to a kernel method with a

-dependent feature map

, and that first-order

corrections introduce data-dependent, cubic feature-learning corrections controlled by the emergent ratio

(an effective posterior depth). The analysis is built on a novel combinatorial model for prior moments expressed via random graphs and a self-loop process, enabling perturbative expansions of the partition function

and the predictive posterior. In the zero-temperature limit, the authors show conditions under which depth enhances model evidence and generalization for certain data spectra (e.g., power laws) and discuss benign overfitting in deep linear vs nonlinear networks, with depth increasingly beneficial when

and data are well-aligned. Overall, the work connects kernel methods with data-dependent learning in deep nonlinear networks and provides a controlled framework to quantify how depth and nonlinearity influence Bayesian inference in high-dimensional, overparameterized regimes.

Abstract

We show at a physics level of rigor that Bayesian inference with a fully connected neural network and a shaped nonlinearity of the form

is (perturbatively) solvable in the regime where the number of training datapoints

, the input dimension

, the network layer widths

, and the network depth

are simultaneously large. Our results hold with weak assumptions on the data; the main constraint is that

. We provide techniques to compute the model evidence and posterior to arbitrary order in

and at arbitrary temperature. We report the following results from the first-order computation: 1. When the width

is much larger than the depth

and training set size

, neural network Bayesian inference coincides with Bayesian inference using a kernel. The value of

determines the curvature of a sphere, hyperbola, or plane into which the training data is implicitly embedded under the feature map. 2. When

is a small constant, neural network Bayesian inference departs from the kernel regime. At zero temperature, neural network Bayesian inference is equivalent to Bayesian inference using a data-dependent kernel, and

serves as an effective depth that controls the extent of feature learning. 3. In the restricted case of deep linear networks (

) and noisy data, we show a simple data model for which evidence and generalization error are optimal at zero temperature. As

increases, both evidence and generalization further improve, demonstrating the benefit of depth in benign overfitting.

Paper Structure (33 sections, 15 theorems, 237 equations, 3 figures)

This paper contains 33 sections, 15 theorems, 237 equations, 3 figures.

Introduction
Informal Overview of Results
Statement of Results and Relation to Prior Work
Definitions
Model: Shaped Neural Networks
Data
Prior, Likelihood, Posterior
Results
Kernel Regime for Shaped Networks
Perturbation Theory Around Kernel Regime
Combinatorial Model for Prior Moments
Review of Literature
Limitations
Computing the Prior: Graphical Model
Derivation of Proposition \ref{['prop:M-rep']}
...and 18 more sections

Key Result

Proposition 2.2

Fix $\overline{\mu}, \overline{\nu}$ and consider the random graph process $G^{(\ell)}=\left(V_{\overline{\mu},\overline{\nu}},E^{(\ell)}\right)$ from Definition def:rand-graph. Define We then have where $\mathrm{sgn}\left(\cdot\right)$ is the sign function, for any $\ell=0,\ldots, L$ we define the edge-weights by and the expectation is over the random graph process.

Figures (3)

Figure 1: A geometric description of the kernel method that is equivalent to Bayesian inference with shaped MLPs in the regime where $N\gg L,P$. In the figure, we fix the strength $\psi$ of the nonlinearity; varying $\psi$ corresponds to varying the curvature of the hyperbola ($\psi > 0$) or sphere ($\psi < 0$). For simplicity of notation, we write $x$ in the figure for $x/\sqrt{N_0}$.
Figure 2: Phase diagram of the log evidence of a deep nonlinear network at zero temperature. The dataset covariance matrix has a power law spectrum $\lambda_j \sim j^{-\alpha}$\ref{['eq:power-law']} and the label vector lies in the $k$th direction \ref{['eq:power-label']} for $k = P^\gamma$. The first-order in $1/N$ is perturbatively valid for $\gamma < 1/\alpha$ and $\alpha < 2$. Within the perturbative regime, depth improves the evidence; at the two boundaries of the regime, depth either increases or decreases the evidence. See Fig. \ref{['fig:finite-temp-phase']} for the phase diagram at nonzero temperature.
Figure 3: Phase diagram of the leading-order log evidence of a deep nonlinear network at temperature $B:=\beta P = P^\delta$ (neglecting positive constants). The dataset covariance matrix has a power law spectrum $\lambda_j \sim j^{-\alpha}$\ref{['eq:power-law']} and the label vector lies in the $k$th direction \ref{['eq:power-label']} for $k = P^\gamma$. When $\alpha < \delta$, the network is in the zero-temperature regime shown in Fig. \ref{['fig:zero-temp-phase']}. We only show here solutions for $\delta > 1$; when $\delta < 1$, the log evidence is $-P \log P$ to leading order for all $\alpha, \gamma$.

Theorems & Definitions (23)

Definition 2.1: Random Graph Process
Proposition 2.2
Proposition 3.1
Corollary 3.2
Lemma 3.3
proof
Proposition 4.1
Corollary 4.2
Proposition 5.1
proof
...and 13 more

Bayesian Inference with Deep Weakly Nonlinear Networks

TL;DR

Abstract

Bayesian Inference with Deep Weakly Nonlinear Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (23)