Vecchia Gaussian Process Ensembles on Internal Representations of Deep Neural Networks
Felix Jimenez, Matthias Katzfuss
TL;DR
The paper introduces the deep Vecchia ensemble (DVE), a deterministic uncertainty quantification (UQ) framework that leverages multiple intermediate representations of pretrained neural networks. By building an ensemble of Vecchia Gaussian processes on layer-specific embeddings and fusing their predictions via a product-of-experts, DVE achieves scalable UQ without retraining and can distinguish aleatoric from epistemic uncertainty. The method demonstrates competitive RMSE and improved NLL on UCI regression tasks and chemical-property prediction, while providing interpretable conditioning-sets that reveal which training points influence a test prediction. DVE addresses feature collapse and enables uncertainty estimates for pretrained models, with potential applications in latent-space optimization and robust decision-making. Limitations include reliance on access to training data and Gaussian-likelihood assumptions, suggesting future work on non-Gaussian likelihoods and integration with Bayesian weight models.
Abstract
For regression tasks, standard Gaussian processes (GPs) provide natural uncertainty quantification (UQ), while deep neural networks (DNNs) excel at representation learning. Deterministic UQ methods for neural networks have successfully combined the two and require only a single pass through the neural network. However, current methods necessitate changes to network training to address feature collapse, where unique inputs map to identical feature vectors. We propose an alternative solution, the deep Vecchia ensemble (DVE), which allows deterministic UQ to work in the presence of feature collapse, negating the need for network retraining. DVE comprises an ensemble of GPs built on hidden-layer outputs of a DNN, achieving scalability via Vecchia approximations that leverage nearest-neighbor conditional independence. DVE is compatible with pretrained networks and incurs low computational overhead. We demonstrate DVE's utility on several datasets and carry out experiments to understand the inner workings of the proposed method.
