Forward $χ^2$ Divergence Based Variational Importance Sampling

Chengrui Li; Yule Wang; Weihan Li; Anqi Wu

Forward $χ^2$ Divergence Based Variational Importance Sampling

Chengrui Li, Yule Wang, Weihan Li, Anqi Wu

TL;DR

The paper introduces variational importance sampling (VIS), a method that directly optimizes the marginal log-likelihood $\ln p(\bm x;\theta)$ for latent-variable models by using an optimal proposal distribution that minimizes the forward $\chi^2$ divergence. VIS yields a tighter log-likelihood estimator than the ELBO as the Monte Carlo budget grows, and provides a numerically stable gradient framework in log-space to update the proposal. The authors demonstrate VIS across toy mixtures, variational auto-encoders, and partially observable GLMs, including synthetic and real neural datasets, showing consistent improvements in LL, CLL, and HLL as well as superior parameter recovery. The work offers a practical, statistically principled alternative to VI and related IS-based methods, with potential broad impact for learning in complex latent-variable models where posterior ambiguity is a challenge.

Abstract

Maximizing the log-likelihood is a crucial aspect of learning latent variable models, and variational inference (VI) stands as the commonly adopted method. However, VI can encounter challenges in achieving a high log-likelihood when dealing with complicated posterior distributions. In response to this limitation, we introduce a novel variational importance sampling (VIS) approach that directly estimates and maximizes the log-likelihood. VIS leverages the optimal proposal distribution, achieved by minimizing the forward $χ^2$ divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation.

Forward $χ^2$ Divergence Based Variational Importance Sampling

TL;DR

The paper introduces variational importance sampling (VIS), a method that directly optimizes the marginal log-likelihood

for latent-variable models by using an optimal proposal distribution that minimizes the forward

divergence. VIS yields a tighter log-likelihood estimator than the ELBO as the Monte Carlo budget grows, and provides a numerically stable gradient framework in log-space to update the proposal. The authors demonstrate VIS across toy mixtures, variational auto-encoders, and partially observable GLMs, including synthetic and real neural datasets, showing consistent improvements in LL, CLL, and HLL as well as superior parameter recovery. The work offers a practical, statistically principled alternative to VI and related IS-based methods, with potential broad impact for learning in complex latent-variable models where posterior ambiguity is a challenge.

Abstract

divergence, to enhance log-likelihood estimation. We apply VIS to various popular latent variable models, including mixture models, variational auto-encoders, and partially observable generalized linear models. Results demonstrate that our approach consistently outperforms state-of-the-art baselines, both in terms of log-likelihood and model parameter estimation.

Paper Structure (37 sections, 25 equations, 9 figures, 1 table, 1 algorithm)

This paper contains 37 sections, 25 equations, 9 figures, 1 table, 1 algorithm.

Introduction
Background of variational inference
Bias of the ELBO estimator.
Variational importance sampling
Down-biased IS estimator of the marginal log-likelihood.
Bias of the IS estimator.
Gradient estimator.
Experiments
Baselines for comparison.
Metrics.
A toy mixture model
Model.
Experimental setup.
Results.
Variational auto-encoder
...and 22 more sections

Figures (9)

Figure 1: (a): The bias between the marginal log-likelihood $\ln p(\bm x;\theta)$ and the expectation of its IS estimator $\mathbb{E}_q[\ln \hat{p}(\bm x;\theta,\phi)]$, the $\mathrm{ELBO}(\bm x;\theta,\phi)$, and the expectation of the ELBO's estimator $\mathbb{E}_q[\widehat{\mathrm{ELBO}}(\bm x;\theta,\phi)]$. When estimating $\ln p(\bm x;\theta)$, the down-biased IS estimator $\mathbb{E}_q[\ln \hat{p}(\bm x;\theta,\phi)]$ is a tighter lower bound than the down-biased ELBO estimator $\mathbb{E}_q[\mathrm{ELBO}(\bm x;\theta,\phi)]$. (b): Empirical visualization of the four quantities in (a) with different Monte Carlo samples $K \in \left\{1,2,3,4,5\right\}$. Each box in (b) is based on 500 repeats and the hollow circle on the box is their average. An asymptotic difference occurs when increasing $K$. (c): Different $q(\bm z|\bm x;\phi)$ are obtained by minimizing the forward $\chi^2$ divergence, which is optimal for doing IS v.s. by minimizing the reverse KL divergence.
Figure 2: (a): LL, CLL, and HLL evaluated on the test dataset. (b): Convergence curves of the parameter set $\theta$ learned by different methods. The dashed curves are the true parameters used for generating the data, and the solid curves are the learned parameters. (c): The posterior distribution given $x=0$ and $x=1$ learned by different methods. The dashed curves are the true posterior $p(\bm z|x;\theta^{\text{true}})$, the solid curves are the learned posterior $p(\bm z|x;\theta)$, and the dotted curves are the approximated posterior $q(z|x;\phi)$ learned in the variational/proposal distribution.
Figure 3: (a): The marginal log-likelihood on the test set after each training epoch. (b): Examples of raw images and the reconstructed images by different methods.
Figure 4: (a): Graphical model of $p(\bm X,\bm Z;\theta)$ and $q(\bm Z|\bm X;\phi)$. (b): The LL, CLL, HLL on the test set, and the average parameter error of the weights and biases in the linear mapping. (c): True and estimated parameters by different methods of the first trial. For each matrix, the leftmost column is the bias $\bm b$, and the remaining block is the weight $\bm W$. The top-left block of the weight part represents visible-to-visible, the top-right block represents hidden-to-visible, the bottom-left block represents visible-to-hidden, and the bottom-right block represents hidden-to-hidden. (d): Predictive firing rates on a spike train from different methods. Specifically, given a complete test spike train $\bm Y = [\bm X, \bm Z]$, we can predict the firing rates by the complete model $p(\bm X,\bm Z;\theta)$ via Eq. \ref{['eq:glm']} for both observed neurons (e.g., neuron 1) and hidden neurons (e.g., neuron 4). For hidden neurons (e.g., neuron 4), we can also predict the firing rates by $q(\bm Z|\bm X;\phi)$.
Figure 5: (a): The marginal log-likelihood on the test segment with different numbers of hidden neurons. (b): The estimated weight matrices from different methods. (c): 20 predictive firing rates generated from 20 hidden spikes sampled from different variational/proposal distributions.
...and 4 more figures

Forward $χ^2$ Divergence Based Variational Importance Sampling

TL;DR

Abstract

Forward $χ^2$ Divergence Based Variational Importance Sampling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)