Table of Contents
Fetching ...

Epistemic Uncertainty and Observation Noise with the Neural Tangent Kernel

Sergio Calvo-Ordoñez, Konstantina Palla, Kamil Ciosek

TL;DR

The paper addresses epistemic uncertainty in wide neural networks by extending the NTK-GP correspondence to settings with non-zero aleatoric noise. It derives estimators for the NTK-GP posterior mean under observation noise and for the posterior covariance, both computable via gradient-descent-based optimization. A data-shift trick yields a zero-mean prior, while a covariance estimator leverages a partial SVD of the Jacobian to decompose cross-kernel terms, enabling scalable uncertainty quantification. Empirical results on a toy regression task demonstrate that the proposed mean and covariance approximations closely track the analytic NTK-GP posterior, while remaining computationally efficient and integrable with standard training pipelines.

Abstract

Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process (GP) with the Neural Tangent Kernel (NTK) as the prior covariance and zero aleatoric noise \parencite{jacot2018neural}. In this paper, we extend this framework in two ways. First, we show how to deal with non-zero aleatoric noise. Second, we derive an estimator for the posterior covariance, giving us a handle on epistemic uncertainty. Our proposed approach integrates seamlessly with standard training pipelines, as it involves training a small number of additional predictors using gradient descent on a mean squared error loss. We demonstrate the proof-of-concept of our method through empirical evaluation on synthetic regression.

Epistemic Uncertainty and Observation Noise with the Neural Tangent Kernel

TL;DR

The paper addresses epistemic uncertainty in wide neural networks by extending the NTK-GP correspondence to settings with non-zero aleatoric noise. It derives estimators for the NTK-GP posterior mean under observation noise and for the posterior covariance, both computable via gradient-descent-based optimization. A data-shift trick yields a zero-mean prior, while a covariance estimator leverages a partial SVD of the Jacobian to decompose cross-kernel terms, enabling scalable uncertainty quantification. Empirical results on a toy regression task demonstrate that the proposed mean and covariance approximations closely track the analytic NTK-GP posterior, while remaining computationally efficient and integrable with standard training pipelines.

Abstract

Recent work has shown that training wide neural networks with gradient descent is formally equivalent to computing the mean of the posterior distribution in a Gaussian Process (GP) with the Neural Tangent Kernel (NTK) as the prior covariance and zero aleatoric noise \parencite{jacot2018neural}. In this paper, we extend this framework in two ways. First, we show how to deal with non-zero aleatoric noise. Second, we derive an estimator for the posterior covariance, giving us a handle on epistemic uncertainty. Our proposed approach integrates seamlessly with standard training pipelines, as it involves training a small number of additional predictors using gradient descent on a mean squared error loss. We demonstrate the proof-of-concept of our method through empirical evaluation on synthetic regression.
Paper Structure (23 sections, 7 theorems, 36 equations, 1 figure, 2 algorithms)

This paper contains 23 sections, 7 theorems, 36 equations, 1 figure, 2 algorithms.

Key Result

Lemma 3.1

Consider a parametric model $f(x; \theta)$ where $x \in \mathcal{X} \subset \mathbb{R}^N$ and $\theta \in \mathbb{R}^p$, initialized under some assumptions with parameters $\theta_0$. Minimizing the regularized mean squared error loss with respect to $\theta$ to find the optimal set of parameters $\ is equivalent to computing the mean posterior of a Gaussian process with non-zero aleatoric noise,

Figures (1)

  • Figure 1: The NTK-GP posterior and its approximations: (top-left) Analytic Posterior, (top-right) Analytic upper bound on posterior (all eigenvectors), (bottom-left) Analytic upper bound on posterior (5 eigenvectors), (bottom-right) Posterior obtained with gradient descent ($K=5$ predictors, $K' = 0$).

Theorems & Definitions (12)

  • Lemma 3.1
  • Lemma 3.2
  • Proposition 3.1
  • proof
  • Lemma B.1
  • proof
  • Lemma B.1
  • proof
  • Lemma D.1
  • proof
  • ...and 2 more