Table of Contents
Fetching ...

A Gaussian Process View on Observation Noise and Initialization in Wide Neural Networks

Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Alvaro Cartea, Jose Miguel Hernandez-Lobato, Konstantina Palla, Kamil Ciosek

TL;DR

This work extends the Neural Tangent Kernel (NTK) framework to be practically usable for Gaussian Process (GP) inference on real data. It introduces a weight-space regularizer that injects aleatoric noise into the NTK-GP posterior mean and proves that, in wide networks, the training dynamics remain effectively linear, yielding a closed-form posterior mean with nonzero noise: $f^{\text{lin}}_{\theta_0}(x') = f(x,\theta_0) + \mathbf{\Theta}_{x',x}(\mathbf{\Theta}_{x,x} + \beta I)^{-1}(\mathbf{y} - f(x,\theta_0))$. To accommodate arbitrary prior means, it proposes a shifted-network construction: $\widetilde{f}_{\theta_0}(x,\theta) = f(x,\theta) - f(x,\theta_0) + m(x)$, which, in the infinite-width limit, yields the posterior mean $m(x') + \mathbf{\Theta}_{x',x}(\mathbf{\Theta}_{x,x} + \beta I)^{-1}(\mathbf{y} - m(x))$. A single training run, without ensembling or kernel inversion, suffices to obtain the NTK-GP posterior mean with prior $m$; the method is validated by experiments showing convergence to the linearized model and improved data efficiency when using a learned prior. This work thus enables practical GP inference with NTK-GP on noisy data and arbitrary priors, potentially broadening the applicability of NTK-based Bayesian reasoning in real-world settings.

Abstract

Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific prior mean and with zero observation noise. However, existing formulations have two limitations: (i) the NTK-GP assumes noiseless targets, leading to misspecification on noisy data; (ii) the equivalence does not extend to arbitrary prior means, which are essential for well-specified models. To address (i), we introduce a regularizer into the training objective, showing its correspondence to incorporating observation noise in the NTK-GP. To address (ii), we propose a \textit{shifted network} that enables arbitrary prior means and allows obtaining the posterior mean with gradient descent on a single network, without ensembling or kernel inversion. We validate our results with experiments across datasets and architectures, showing that this approach removes key obstacles to the practical use of NTK-GP equivalence in applied Gaussian process modeling.

A Gaussian Process View on Observation Noise and Initialization in Wide Neural Networks

TL;DR

This work extends the Neural Tangent Kernel (NTK) framework to be practically usable for Gaussian Process (GP) inference on real data. It introduces a weight-space regularizer that injects aleatoric noise into the NTK-GP posterior mean and proves that, in wide networks, the training dynamics remain effectively linear, yielding a closed-form posterior mean with nonzero noise: . To accommodate arbitrary prior means, it proposes a shifted-network construction: , which, in the infinite-width limit, yields the posterior mean . A single training run, without ensembling or kernel inversion, suffices to obtain the NTK-GP posterior mean with prior ; the method is validated by experiments showing convergence to the linearized model and improved data efficiency when using a learned prior. This work thus enables practical GP inference with NTK-GP on noisy data and arbitrary priors, potentially broadening the applicability of NTK-based Bayesian reasoning in real-world settings.

Abstract

Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific prior mean and with zero observation noise. However, existing formulations have two limitations: (i) the NTK-GP assumes noiseless targets, leading to misspecification on noisy data; (ii) the equivalence does not extend to arbitrary prior means, which are essential for well-specified models. To address (i), we introduce a regularizer into the training objective, showing its correspondence to incorporating observation noise in the NTK-GP. To address (ii), we propose a \textit{shifted network} that enables arbitrary prior means and allows obtaining the posterior mean with gradient descent on a single network, without ensembling or kernel inversion. We validate our results with experiments across datasets and architectures, showing that this approach removes key obstacles to the practical use of NTK-GP equivalence in applied Gaussian process modeling.

Paper Structure

This paper contains 39 sections, 14 theorems, 118 equations, 7 figures.

Key Result

Theorem 4.1

For training time $t\to\infty$, at any point $\mathbf{x'}$,

Figures (7)

  • Figure 1: Frobenius norm differences between the trained neural network's parameters and the kernel ridge regression solution plotted against network width for different datasets. Left: Different input dimensions of the synthetic dataset. Middle: Airline and Taxi UCI datasets. Right: Year UCI dataset. Shaded regions represent the standard deviation divided by the square root of the number of seeds.
  • Figure 2: Supremum of the $\ell_2$ norm between the trained neural network's outputs $f(x, \theta_\infty)$ and the linearized model's predictions $f^{\text{lin}}(x, \theta_\infty^{\text{lin}})$ across the validation set, plotted against network width. Left: Different input dimensions of the synthetic dataset. Middle: Airline and Taxi UCI datasets. Right: Year UCI dataset.
  • Figure 3: Test MSE on Task 2 as a function of the number of training points, comparing training from scratch (Vanilla) with using the posterior from Task 1 as a prior mean (Pre-training). Pre-training significantly improves performance in low-data regimes.
  • Figure 4: Parameter and function differences for additional network depths. (Top row) Parameter difference plots from Section \ref{['sec: 5.2']}. (Bottom row) corresponds to the function difference plots from Section \ref{['sec: 5.3']}. (Left) Results for one fully connected hidden layer. (Right) Results for an MLP with three fully connected hidden layers. In all cases, increasing the network width reduces both parameter and function differences, confirming the theoretical predictions. $\beta = 0.1$ was used.
  • Figure 5: $\ell_2$ norm differences between the trained neural network's parameters and the kernel ridge regression solution plotted against network width for different $\beta$. Shaded regions represent the standard deviation divided by the square root of the number of seeds.
  • ...and 2 more figures

Theorems & Definitions (22)

  • Theorem 4.1
  • Theorem 4.2
  • Theorem 4.3
  • Theorem 5.1
  • Lemma A.0
  • Theorem E.1
  • Lemma E.2
  • Lemma F.1
  • proof
  • Theorem F.1
  • ...and 12 more