Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching
John Paisley, Wei Zhang, Brian Barr
TL;DR
The paper tackles nonparametric density estimation by constructing a GP-tilted density $q(x) \propto \exp\{f(x)\}\mathcal{N}(x|\mu,\Sigma)$ with $f\sim\mathcal{GP}(0,k)$ and tractably learning it via Fisher divergence score matching. By approximating the GP with random Fourier features, the model reduces to a score given by a single-layer cosine-activated network with learnable vector $\theta$, enabling three closed-form FD-based learning algorithms: basic FD, noise-conditional FD, and a Fisher variational predictive distribution (FVPD). The approach yields efficient, one-pass capable learning and broad applicability to moderate-dimensional problems, with FVPD providing a Bayesian-like predictive distribution by integrating over $\theta$. Empirical results on ten UCI datasets and additional benchmarks show competitive density estimation accuracy and substantial computational advantages, especially for FVPD, compared to MAP and KDE baselines.
Abstract
We propose a nonparametric density estimator based on the Gaussian process (GP) and derive three novel closed form learning algorithms based on Fisher divergence (FD) score matching. The density estimator is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that optimization can be solved in closed form for the three FD-based objectives considered. This includes the basic and noise conditional versions of the Fisher divergence, as well as an alternative to noise conditional FD models based on variational inference (VI) that we propose in this paper. For this novel learning approach, we propose an ELBO-like optimization to approximate the posterior distribution, with which we then derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a useful linear representation of the GP for which all expectations can be solved. The Gaussian base distribution also helps with tractability of the VI approximation and ensures that our proposed density is well-defined. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed form nature of the learning problem removes the reliance on iterative learning algorithms, making this technique particularly well-suited to big data sets, since only sufficient statistics collected from a single pass through the data is needed.
