A Gaussian Process View on Observation Noise and Initialization in Wide Neural Networks
Sergio Calvo-Ordoñez, Jonathan Plenk, Richard Bergna, Alvaro Cartea, Jose Miguel Hernandez-Lobato, Konstantina Palla, Kamil Ciosek
TL;DR
This work extends the Neural Tangent Kernel (NTK) framework to be practically usable for Gaussian Process (GP) inference on real data. It introduces a weight-space regularizer that injects aleatoric noise into the NTK-GP posterior mean and proves that, in wide networks, the training dynamics remain effectively linear, yielding a closed-form posterior mean with nonzero noise: $f^{\text{lin}}_{\theta_0}(x') = f(x,\theta_0) + \mathbf{\Theta}_{x',x}(\mathbf{\Theta}_{x,x} + \beta I)^{-1}(\mathbf{y} - f(x,\theta_0))$. To accommodate arbitrary prior means, it proposes a shifted-network construction: $\widetilde{f}_{\theta_0}(x,\theta) = f(x,\theta) - f(x,\theta_0) + m(x)$, which, in the infinite-width limit, yields the posterior mean $m(x') + \mathbf{\Theta}_{x',x}(\mathbf{\Theta}_{x,x} + \beta I)^{-1}(\mathbf{y} - m(x))$. A single training run, without ensembling or kernel inversion, suffices to obtain the NTK-GP posterior mean with prior $m$; the method is validated by experiments showing convergence to the linearized model and improved data efficiency when using a learned prior. This work thus enables practical GP inference with NTK-GP on noisy data and arbitrary priors, potentially broadening the applicability of NTK-based Bayesian reasoning in real-world settings.
Abstract
Performing gradient descent in a wide neural network is equivalent to computing the posterior mean of a Gaussian Process with the Neural Tangent Kernel (NTK-GP), for a specific prior mean and with zero observation noise. However, existing formulations have two limitations: (i) the NTK-GP assumes noiseless targets, leading to misspecification on noisy data; (ii) the equivalence does not extend to arbitrary prior means, which are essential for well-specified models. To address (i), we introduce a regularizer into the training objective, showing its correspondence to incorporating observation noise in the NTK-GP. To address (ii), we propose a \textit{shifted network} that enables arbitrary prior means and allows obtaining the posterior mean with gradient descent on a single network, without ensembling or kernel inversion. We validate our results with experiments across datasets and architectures, showing that this approach removes key obstacles to the practical use of NTK-GP equivalence in applied Gaussian process modeling.
