Table of Contents
Fetching ...

Deterministic Fokker-Planck Transport -- With Applications to Sampling, Variational Inference, Kernel Mean Embeddings & Sequential Monte Carlo

Ilja Klebanov

TL;DR

By closely examining the drawbacks of approximating this density via kernel density estimation, opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo are uncovered.

Abstract

The Fokker-Planck equation can be reformulated as a continuity equation, which naturally suggests using the associated velocity field in particle flow methods. While the resulting probability flow ODE offers appealing properties - such as defining a gradient flow of the Kullback-Leibler divergence between the current and target densities with respect to the 2-Wasserstein distance - it relies on evaluating the current probability density, which is intractable in most practical applications. By closely examining the drawbacks of approximating this density via kernel density estimation, we uncover opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo.

Deterministic Fokker-Planck Transport -- With Applications to Sampling, Variational Inference, Kernel Mean Embeddings & Sequential Monte Carlo

TL;DR

By closely examining the drawbacks of approximating this density via kernel density estimation, opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo are uncovered.

Abstract

The Fokker-Planck equation can be reformulated as a continuity equation, which naturally suggests using the associated velocity field in particle flow methods. While the resulting probability flow ODE offers appealing properties - such as defining a gradient flow of the Kullback-Leibler divergence between the current and target densities with respect to the 2-Wasserstein distance - it relies on evaluating the current probability density, which is intractable in most practical applications. By closely examining the drawbacks of approximating this density via kernel density estimation, we uncover opportunities to turn these limitations into advantages in contexts such as variational inference, kernel mean embeddings, and sequential Monte Carlo.

Paper Structure

This paper contains 19 sections, 46 equations, 6 figures, 1 table, 1 algorithm.

Figures (6)

  • Figure 4.1: $\rho_{\textup{ref}}$-distributed samples (standard Gaussian) are transported to $\rho_{\textup{tar}}$-distributed samples (a mixture of three Gaussian densities) by the SDE \ref{['equ:SDE']} using the Euler--Maruyama discretization (left) and by the ODE \ref{['equ:ODE']} with $v_t = v_t^{{\textup{FP}}}$ (right). The densities of both methods coincide analytically for each $t \ge 0$. Ten trajectories are shown in black. The density estimate required for evaluating $v_t^{{\textup{FP}}}$ was performed using kernel density estimation based on the current samples at each time step. The implications of this approximation are discussed in detail in \ref{['section:Approach']}. Notably, as observed in the right plot, the final points are too "concentrated" to represent the target density $\rho_{\textup{tar}}$ and exhibit more regularity than independent samples, particularly near the outer regions.
  • Figure 6.1: Left: The target density $\rho_{\textup{tar}}$ with $J = 23$ KDE points. Due to the asymmetric initial density $\rho_{\textup{ref}}$, significantly more KDE points are concentrated near the right mode than the left (see \ref{['section:Annealing']} for potential corrections). Middle: The estimated density $\hat{\rho}_{t}^{h}$ with $K = 526$ KDE-QMC points. The asymmetry is corrected through larger importance weights (indicated by marker sizes) near the left mode. Right: Error estimates for $\mathbb{E}_{\rho_{\textup{tar}}}[f]$ with $f(x) = x$ using MCMC, independent and stratified samples from $\hat{\rho}_{t}^{h}$, and KDE-QMC points. As expected, the first three methods exhibit a convergence rate of $K^{-1/2}$ (the error plots display the mean over ten independent runs), whereas the KDE-QMC points achieve a convergence rate of approximately $K^{-1}$.
  • Figure 6.2: Comparison of 78 KDE points with the first 78 samples from kernel herding (left) and sequential Bayesian quadrature (right; marker sizes correspond to sample weights). While all three point sets are fairly evenly distributed, herding and SBQ samples are more frequently placed in regions of lower density. This behavior is due to the sequential nature of their generation, as illustrated by the green circles: Once a "central" point is fixed at the optimal position for a given time step, it cannot be adjusted later, forcing subsequent samples into positions further away. In contrast, KDE points benefit from greater flexibility since they are generated simultaneously. In addition, KDE points do not require solving non-convex optimization problems, as discussed in \ref{['remark:Oprimization_issues_SBQ']}.
  • Figure 6.3: Performance comparison of KDE points, kernel herding, and SBQ using both uniform and SBQ weights, measured by the average quadrature error over 50 randomly selected functions (left) and by maximum mean discrepancy (MMD) (right).
  • Figure 6.4: Illustration of two steps in SMC for a toy example. The resampling step is carried out by first embedding the weighted point set, followed by outbedding to obtain an unweighted set of KDE points. With only 40 KDE points, the kernel mean embedding is approximated with notable accuracy.
  • ...and 1 more figures