Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

Katerina Papagiannouli; Dario Trevisan; Giuseppe Pio Zitto

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

Katerina Papagiannouli, Dario Trevisan, Giuseppe Pio Zitto

TL;DR

Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level, in contrast with fixed-kernel (NNGP) theory.

Abstract

We study wide Bayesian neural networks focusing on the rare but statistically dominant fluctuations that govern posterior concentration, beyond Gaussian-process limits. Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level. We show that the posterior output rate function is obtained by a joint optimization over predictors and internal kernels, in contrast with fixed-kernel (NNGP) theory. Numerical experiments demonstrate that the resulting predictions accurately describe finite-width behavior for moderately sized networks, capturing non-Gaussian tails, posterior deformation, and data-dependent kernel selection effects.

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

TL;DR

Abstract

Paper Structure (32 sections, 6 theorems, 54 equations, 12 figures)

This paper contains 32 sections, 6 theorems, 54 equations, 12 figures.

Introduction
A variational perspective on Bayesian learning
Learning-theoretic interpretation and connections
Gaussian Processes and Gaussian Neural Networks
GP case -- Fixed Kernels.
Wide Gaussian Neural Networks -- Kernel Selection.
Numerical Experiments
01A: Prior rate function.
01B: Posterior rate function.
01C: MAP prediction (mode) as a function of the input.
02A: Prior rate --- LDP versus NNGP quadratic rate
02C: Predictive curve --- LDP-MAP versus NNGP posterior mean
Experiment 03: finite-width validation via Monte Carlo
03A: prior tails --- empirical decay vs LDP rate
03B: Posterior samples --- MC vs LDP and NNGP.
...and 17 more sections

Key Result

Proposition 4.1

Fix $\mathcal{X}$ and a depth $L$, and consider the Gaussian network eq:nn-recursion-short with hidden widths $n$. For each $\ell=1,\dots,L-1$, the sequence $(K_n^{(\ell)}(\mathcal{X}))_{n\ge1}$ satisfies a large-deviation principle with rate function where $J^{\sigma^{(\ell)}}(\cdot| \kappa^{(\ell-1)})$ is a layer cost -- given as a Legendre--Fenchel transform of the conditional $\log$-MGF of $K

Figures (12)

Figure 1: Prior output large-deviation rate functions. Prior rate $I_{\mathrm{prior}}(y)$ as a function of the output $y$ for a wide Gaussian neural network with ReLU (left) and $\tanh$ (right) activation.
Figure 2: Posterior deformation of the output large-deviation rate function. Prior and posterior rate functions $I_{\mathrm{prior}}(y)$ and $I_{\mathrm{post}}(y)$ at a fixed test input $x_{\mathrm{test}}=3$, for a wide Gaussian neural network (left, ReLU activation; right $\tanh$ activation) trained on a Heaviside target
Figure 3: Posterior MAP prediction curves. Large-deviation MAP prediction $y^\ast(x_{\mathrm{test}})$ as a function of the test input $x_{\mathrm{test}}$, for a wide Gaussian neural network trained on a Heaviside target. Left: ReLU activation. Right: $\tanh$ activation.
Figure 4: Prior LDP versus NNGP. Left: prior large-deviation rate function compared with the quadratic rate induced by the NNGP kernel. Right: relative operator-norm gap between the kernel selected by the LDP variational problem and the NNGP kernel.
Figure 5: Posterior LDP versus NNGP. Left: posterior large-deviation rate function compared with the quadratic posterior rate induced by Gaussian-process regression with the NNGP kernel. Right: relative operator-norm gap between the kernel selected by the posterior LDP variational problem and the NNGP kernel.
...and 7 more figures

Theorems & Definitions (11)

Proposition 4.1: Prior LDP for layerwise kernels; NNGP as unique minimizer
Theorem 4.2: Prior LDP for outputs
Theorem 4.3: Posterior LDP for rescaled outputs under quadratic loss
Corollary 4.4: Posterior-optimal kernel differs from the NNGP kernel
Definition A.1: LDP convergence
Lemma B.1: Change of measure / Bayes rule for LDP
proof
proof : Proof of Theorem \ref{['thm:posterior-output-ldp']}
Remark B.2
proof : Proof of Corollary \ref{['thm:kernel-separation']}
...and 1 more

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

TL;DR

Abstract

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (11)