Table of Contents
Fetching ...

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

Katerina Papagiannouli, Dario Trevisan, Giuseppe Pio Zitto

TL;DR

Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level, in contrast with fixed-kernel (NNGP) theory.

Abstract

We study wide Bayesian neural networks focusing on the rare but statistically dominant fluctuations that govern posterior concentration, beyond Gaussian-process limits. Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level. We show that the posterior output rate function is obtained by a joint optimization over predictors and internal kernels, in contrast with fixed-kernel (NNGP) theory. Numerical experiments demonstrate that the resulting predictions accurately describe finite-width behavior for moderately sized networks, capturing non-Gaussian tails, posterior deformation, and data-dependent kernel selection effects.

Beyond NNGP: Large Deviations and Feature Learning in Bayesian Neural Networks

TL;DR

Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level, in contrast with fixed-kernel (NNGP) theory.

Abstract

We study wide Bayesian neural networks focusing on the rare but statistically dominant fluctuations that govern posterior concentration, beyond Gaussian-process limits. Large-deviation theory provides explicit variational objectives-rate functions-on predictors, providing an emerging notion of complexity and feature learning directly at the functional level. We show that the posterior output rate function is obtained by a joint optimization over predictors and internal kernels, in contrast with fixed-kernel (NNGP) theory. Numerical experiments demonstrate that the resulting predictions accurately describe finite-width behavior for moderately sized networks, capturing non-Gaussian tails, posterior deformation, and data-dependent kernel selection effects.
Paper Structure (32 sections, 6 theorems, 54 equations, 12 figures)

This paper contains 32 sections, 6 theorems, 54 equations, 12 figures.

Key Result

Proposition 4.1

Fix $\mathcal{X}$ and a depth $L$, and consider the Gaussian network eq:nn-recursion-short with hidden widths $n$. For each $\ell=1,\dots,L-1$, the sequence $(K_n^{(\ell)}(\mathcal{X}))_{n\ge1}$ satisfies a large-deviation principle with rate function where $J^{\sigma^{(\ell)}}(\cdot| \kappa^{(\ell-1)})$ is a layer cost -- given as a Legendre--Fenchel transform of the conditional $\log$-MGF of $K

Figures (12)

  • Figure 1: Prior output large-deviation rate functions. Prior rate $I_{\mathrm{prior}}(y)$ as a function of the output $y$ for a wide Gaussian neural network with ReLU (left) and $\tanh$ (right) activation.
  • Figure 2: Posterior deformation of the output large-deviation rate function. Prior and posterior rate functions $I_{\mathrm{prior}}(y)$ and $I_{\mathrm{post}}(y)$ at a fixed test input $x_{\mathrm{test}}=3$, for a wide Gaussian neural network (left, ReLU activation; right $\tanh$ activation) trained on a Heaviside target
  • Figure 3: Posterior MAP prediction curves. Large-deviation MAP prediction $y^\ast(x_{\mathrm{test}})$ as a function of the test input $x_{\mathrm{test}}$, for a wide Gaussian neural network trained on a Heaviside target. Left: ReLU activation. Right: $\tanh$ activation.
  • Figure 4: Prior LDP versus NNGP. Left: prior large-deviation rate function compared with the quadratic rate induced by the NNGP kernel. Right: relative operator-norm gap between the kernel selected by the LDP variational problem and the NNGP kernel.
  • Figure 5: Posterior LDP versus NNGP. Left: posterior large-deviation rate function compared with the quadratic posterior rate induced by Gaussian-process regression with the NNGP kernel. Right: relative operator-norm gap between the kernel selected by the posterior LDP variational problem and the NNGP kernel.
  • ...and 7 more figures

Theorems & Definitions (11)

  • Proposition 4.1: Prior LDP for layerwise kernels; NNGP as unique minimizer
  • Theorem 4.2: Prior LDP for outputs
  • Theorem 4.3: Posterior LDP for rescaled outputs under quadratic loss
  • Corollary 4.4: Posterior-optimal kernel differs from the NNGP kernel
  • Definition A.1: LDP convergence
  • Lemma B.1: Change of measure / Bayes rule for LDP
  • proof
  • proof : Proof of Theorem \ref{['thm:posterior-output-ldp']}
  • Remark B.2
  • proof : Proof of Corollary \ref{['thm:kernel-separation']}
  • ...and 1 more