Table of Contents
Fetching ...

Noise Contrastive Priors for Functional Uncertainty

Danijar Hafner, Dustin Tran, Timothy Lillicrap, Alex Irpan, James Davidson

TL;DR

This work tackles unreliable uncertainty estimates in neural networks by introducing Noise Contrastive Priors (NCPs), which impose data-space priors to encourage high uncertainty for inputs outside the training distribution. NCPs combine an input perturbation strategy with a wide output prior and can be incorporated into variational frameworks by penalizing deviations in output space, yielding a scalable, function-level prior. Empirically, NCPs improve uncertainty estimates and active-learning performance on both toy and large-scale flight-delay regression tasks, with BBB+NCP often delivering the strongest improvements and stable generalization to unseen data. The approach provides a practical, scalable alternative to weight-space priors and highlights a fruitful direction toward explicit, data-driven priors for robust extrapolation.

Abstract

Obtaining reliable uncertainty estimates of neural network predictions is a long standing challenge. Bayesian neural networks have been proposed as a solution, but it remains open how to specify their prior. In particular, the common practice of an independent normal prior in weight space imposes relatively weak constraints on the function posterior, allowing it to generalize in unforeseen ways on inputs outside of the training distribution. We propose noise contrastive priors (NCPs) to obtain reliable uncertainty estimates. The key idea is to train the model to output high uncertainty for data points outside of the training distribution. NCPs do so using an input prior, which adds noise to the inputs of the current mini batch, and an output prior, which is a wide distribution given these inputs. NCPs are compatible with any model that can output uncertainty estimates, are easy to scale, and yield reliable uncertainty estimates throughout training. Empirically, we show that NCPs prevent overfitting outside of the training distribution and result in uncertainty estimates that are useful for active learning. We demonstrate the scalability of our method on the flight delays data set, where we significantly improve upon previously published results.

Noise Contrastive Priors for Functional Uncertainty

TL;DR

This work tackles unreliable uncertainty estimates in neural networks by introducing Noise Contrastive Priors (NCPs), which impose data-space priors to encourage high uncertainty for inputs outside the training distribution. NCPs combine an input perturbation strategy with a wide output prior and can be incorporated into variational frameworks by penalizing deviations in output space, yielding a scalable, function-level prior. Empirically, NCPs improve uncertainty estimates and active-learning performance on both toy and large-scale flight-delay regression tasks, with BBB+NCP often delivering the strongest improvements and stable generalization to unseen data. The approach provides a practical, scalable alternative to weight-space priors and highlights a fruitful direction toward explicit, data-driven priors for robust extrapolation.

Abstract

Obtaining reliable uncertainty estimates of neural network predictions is a long standing challenge. Bayesian neural networks have been proposed as a solution, but it remains open how to specify their prior. In particular, the common practice of an independent normal prior in weight space imposes relatively weak constraints on the function posterior, allowing it to generalize in unforeseen ways on inputs outside of the training distribution. We propose noise contrastive priors (NCPs) to obtain reliable uncertainty estimates. The key idea is to train the model to output high uncertainty for data points outside of the training distribution. NCPs do so using an input prior, which adds noise to the inputs of the current mini batch, and an output prior, which is a wide distribution given these inputs. NCPs are compatible with any model that can output uncertainty estimates, are easy to scale, and yield reliable uncertainty estimates throughout training. Empirically, we show that NCPs prevent overfitting outside of the training distribution and result in uncertainty estimates that are useful for active learning. We demonstrate the scalability of our method on the flight delays data set, where we significantly improve upon previously published results.

Paper Structure

This paper contains 23 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Predictive distributions on a low-dimensional active learning task. The predictive distributions are visualized as mean and two standard deviations shaded. They decompose into epistemic uncertainty $\blacksquare$ and aleatoric noise $\blacksquare$. Data points are only available within two bands, and are selected using the expected information gain $\blacksquare$. (a) A deterministic network models no uncertainty but only noise, resulting in overconfidence outside of the data distribution. (b) A variational Bayesian neural network with independent normal prior represents uncertainty and noise separately but is overconfident outside of the training distribution. (c) On the OOD classifier model, NCP prevents overconfidence. (d) On the Bayesian neural network, NCP produces smooth uncertainty estimates that generalize well to unseen data points. Models trained with NCP also separate uncertainty and noise well. The experimental setup is described in \ref{['sec:toy-active']}.
  • Figure 2: Graphical representations of the two uncertainty-aware models we consider. Circles denote random variables, squares denote deterministic variables, shading denotes observations during training. (a) The Bayesian neural network captures a belief over parameters for the predictive mean, while the predictive variance is a deterministic function of the input. In practice, we only use weight uncertainty for the mean's output layer and share earlier layers between the mean and variance. (b) The out-of-distribution classifier model uses a binary auxiliary variable $o$ to determine if a given input is out-of-distribution; given its value, the output mixed between a neural network prediction and a wide output prior.
  • Figure 3: Active learning on the 1-dimensional regression problem, mean and standard deviation over 20 seeds. The test root mean squared error (RMSE) and negative log predictive density (NLPD) of the models trained with NCP decreases during the active learning run, while the baseline models select less informative data and overfit. The deterministic network is barely visible in the plots as it overfits quickly. \ref{['fig:visualizations']} shows the predictive distributions of the models.
  • Figure 4: Active learning on the flights data set. The models trained with NCP achieve significantly lower negative log predictive density (NLPD) on the test set, and Bayes by Backprop with NCP achieves the lowest root mean squared error (RMSE). The test NLPD for the baseline models diverges as they overfit to the visible data points. Plots show mean and std over 10 runs.
  • Figure 5: Robustness to different noise patterns. Plots show the final test performance on the flights active learning task (mean and stddev over 5 seeds). Lower is better. NCP is robust to the choice of input noise and improves over the baselines in all settings (compare \ref{['fig:flights-active']}).
  • ...and 1 more figures