Table of Contents
Fetching ...

Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching

John Paisley, Wei Zhang, Brian Barr

TL;DR

The paper tackles nonparametric density estimation by constructing a GP-tilted density $q(x) \propto \exp\{f(x)\}\mathcal{N}(x|\mu,\Sigma)$ with $f\sim\mathcal{GP}(0,k)$ and tractably learning it via Fisher divergence score matching. By approximating the GP with random Fourier features, the model reduces to a score given by a single-layer cosine-activated network with learnable vector $\theta$, enabling three closed-form FD-based learning algorithms: basic FD, noise-conditional FD, and a Fisher variational predictive distribution (FVPD). The approach yields efficient, one-pass capable learning and broad applicability to moderate-dimensional problems, with FVPD providing a Bayesian-like predictive distribution by integrating over $\theta$. Empirical results on ten UCI datasets and additional benchmarks show competitive density estimation accuracy and substantial computational advantages, especially for FVPD, compared to MAP and KDE baselines.

Abstract

We propose a nonparametric density estimator based on the Gaussian process (GP) and derive three novel closed form learning algorithms based on Fisher divergence (FD) score matching. The density estimator is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that optimization can be solved in closed form for the three FD-based objectives considered. This includes the basic and noise conditional versions of the Fisher divergence, as well as an alternative to noise conditional FD models based on variational inference (VI) that we propose in this paper. For this novel learning approach, we propose an ELBO-like optimization to approximate the posterior distribution, with which we then derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a useful linear representation of the GP for which all expectations can be solved. The Gaussian base distribution also helps with tractability of the VI approximation and ensures that our proposed density is well-defined. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed form nature of the learning problem removes the reliance on iterative learning algorithms, making this technique particularly well-suited to big data sets, since only sufficient statistics collected from a single pass through the data is needed.

Gaussian Process Tilted Nonparametric Density Estimation using Fisher Divergence Score Matching

TL;DR

The paper tackles nonparametric density estimation by constructing a GP-tilted density with and tractably learning it via Fisher divergence score matching. By approximating the GP with random Fourier features, the model reduces to a score given by a single-layer cosine-activated network with learnable vector , enabling three closed-form FD-based learning algorithms: basic FD, noise-conditional FD, and a Fisher variational predictive distribution (FVPD). The approach yields efficient, one-pass capable learning and broad applicability to moderate-dimensional problems, with FVPD providing a Bayesian-like predictive distribution by integrating over . Empirical results on ten UCI datasets and additional benchmarks show competitive density estimation accuracy and substantial computational advantages, especially for FVPD, compared to MAP and KDE baselines.

Abstract

We propose a nonparametric density estimator based on the Gaussian process (GP) and derive three novel closed form learning algorithms based on Fisher divergence (FD) score matching. The density estimator is formed by multiplying a base multivariate normal distribution with an exponentiated GP refinement, and so we refer to it as a GP-tilted nonparametric density. By representing the GP part of the score as a linear function using the random Fourier feature (RFF) approximation, we show that optimization can be solved in closed form for the three FD-based objectives considered. This includes the basic and noise conditional versions of the Fisher divergence, as well as an alternative to noise conditional FD models based on variational inference (VI) that we propose in this paper. For this novel learning approach, we propose an ELBO-like optimization to approximate the posterior distribution, with which we then derive a Fisher variational predictive distribution. The RFF representation of the GP, which is functionally equivalent to a single layer neural network score model with cosine activation, provides a useful linear representation of the GP for which all expectations can be solved. The Gaussian base distribution also helps with tractability of the VI approximation and ensures that our proposed density is well-defined. We demonstrate our three learning algorithms, as well as a MAP baseline algorithm, on several low dimensional density estimation problems. The closed form nature of the learning problem removes the reliance on iterative learning algorithms, making this technique particularly well-suited to big data sets, since only sufficient statistics collected from a single pass through the data is needed.

Paper Structure

This paper contains 17 sections, 40 equations, 5 figures, 3 tables, 4 algorithms.

Figures (5)

  • Figure 1: A simplified flowchart of the GP-tilted density. The input to the exponential function is the Gaussian process approximated by random Fourier features, which turns the GP into a single layer neural network. The only learnable parameter is the vector $\theta \in \mathbb{R}^S$. (We set $S=1000$ in our experiments.) A broad, predefined Gaussian based distribution determines the region of interest for the density to ensure integrability. In Algorithms 2-4, we present novel closed form solutions for learning $\theta$ and constructing predictive distributions based on the Fisher divergence score matching objective function.
  • Figure 2: Samples (orange) shown over data (blue) generated for (a-d) Algorithms 1-4, and (e,f) kernel density estimation (see text for discussion). We set $S=1000$ for (a)-(d) and $S=100K$ for (e), $\lambda = 10$ for (a-d). All methods use the same kernel width $\gamma$.
  • Figure 3: Algorithm 3 example. Noise conditional density contours on CA House data for decreasing $\sigma$. For better visualization, Figure \ref{['fig:CA_samps']} shows samples from the final RHS density contour (NCFD), along with samples from the other learned methods.
  • Figure 4: Example CDFs of projections of MAGIC data used to calculate KS and WD. Data (blue) vs FVPD (red). Each plot corresponds to a different random $\boldsymbol{\mathrm{v}}$.
  • Figure 5: Contour plots of GP-tilted density learned by each algorithm on three data sets.

Theorems & Definitions (2)

  • Definition 2.1: Gaussian Process
  • Definition 2.2: Random Fourier Features