Differentiable Annealed Importance Sampling Minimizes The Symmetrized Kullback-Leibler Divergence Between Initial and Target Distribution

Johannes Zenn; Robert Bamler

Differentiable Annealed Importance Sampling Minimizes The Symmetrized Kullback-Leibler Divergence Between Initial and Target Distribution

Johannes Zenn, Robert Bamler

TL;DR

This work analyzes differentiable annealed importance sampling (DAIS) and proves that, with many annealing steps, DAIS minimizes the symmetrized KL divergence between the learnable initial distribution $q_0$ and the target distribution $f/Z$, offering a variational interpretation of DAIS. By introducing DAIS$_0$, the authors treat $q_0$ as an explicit, compact approximate posterior that can be used directly for inference, avoiding the computational burden of full AIS at test time. Empirically, DAIS$_0$ often yields uncertainty estimates that are more accurate than those from reverse-KL VI, IWVI, and MSC, especially in higher-dimensional settings, while maintaining a tractable and interpretable representation. The findings bridge AIS, VI, and MCMC-based methods, highlighting practical benefits for Gaussian process regression and Bayesian logistic regression, and illustrating the trade-offs between mode-covering and mass-covering behavior in variational approximations.

Abstract

Differentiable annealed importance sampling (DAIS), proposed by Geffner & Domke (2021) and Zhang et al. (2021), allows optimizing over the initial distribution of AIS. In this paper, we show that, in the limit of many transitions, DAIS minimizes the symmetrized Kullback-Leibler divergence between the initial and target distribution. Thus, DAIS can be seen as a form of variational inference (VI) as its initial distribution is a parametric fit to an intractable target distribution. We empirically evaluate the usefulness of the initial distribution as a variational distribution on synthetic and real-world data, observing that it often provides more accurate uncertainty estimates than VI (optimizing the reverse KL divergence), importance weighted VI, and Markovian score climbing (optimizing the forward KL divergence).

Differentiable Annealed Importance Sampling Minimizes The Symmetrized Kullback-Leibler Divergence Between Initial and Target Distribution

TL;DR

and the target distribution

, offering a variational interpretation of DAIS. By introducing DAIS

, the authors treat

as an explicit, compact approximate posterior that can be used directly for inference, avoiding the computational burden of full AIS at test time. Empirically, DAIS

often yields uncertainty estimates that are more accurate than those from reverse-KL VI, IWVI, and MSC, especially in higher-dimensional settings, while maintaining a tractable and interpretable representation. The findings bridge AIS, VI, and MCMC-based methods, highlighting practical benefits for Gaussian process regression and Bayesian logistic regression, and illustrating the trade-offs between mode-covering and mass-covering behavior in variational approximations.

Abstract

Paper Structure (39 sections, 2 theorems, 41 equations, 7 figures, 23 tables)

This paper contains 39 sections, 2 theorems, 41 equations, 7 figures, 23 tables.

Introduction
Related Work
MCMC-Augmented Variational Distributions.
VI With Forward and Reverse KL Divergence.
Differentiable Annealed Importance Sampling (DAIS).
Estimating Normalization Constants
Importance Sampling
Variational Inference
Importance Weighted Variational Inference
(Differentiable) Annealed Importance Sampling
Differentiable Annealed Importance Sampling.
Analyzing the Initial Distribution of DAIS
DAIS Minimizes the Symmetrized KL Divergence
Statement for $N>1$.
Reverse KL Divergence.
...and 24 more sections

Key Result

Theorem 3.1

For large $N$, the gap $\Delta_\textup{IWVI}^N := \log Z - \textup{ELBO}_\textup{IWVI}^N$ of importance weighted VI is proportional to the variance of $w_\textup{IWVI}^N$, defined in eq:weight-IWVI. Formally, if $\lim\sup_{N \to \infty}\mathbb{E}_q[1 \,/\, w_\textup{IWVI}^N] < \infty$ and there exis

Figures (7)

Figure 1: The landscape spanned by various lower bounds to the normalization constant $Z={\int\!f({\bm{z}})\,\mathrm{d}{\bm{z}}}$ where $N$ denotes number of particles and $K$ the number of importance sampling transitions. A discussion and further details can be found in \ref{['sec:background']}.
Figure 2: Mean absolute error of estimated mean and standard deviation as a function of the number of DAIS samples used for the estimate. The estimator converges poorly (see \ref{['sec:method:compactness']}).
Figure 3: Density of variational distributions of VI, IWVI, MSC, and $\text{DAIS}_0$ ($K$) evaluated on samples from a $d$-dimensional bimodal Gaussian target distribution. "-": unable to find an optimum, "c": mass-covering distribution, "s": mode-seeking distribution, "u": undecidable whether "c" or "s". $\text{DAIS}_0$ achieves higher densities in higher dimension $d$ for increasing $K$ across all considered $N$. MSC does not converge for $N=1$. $\text{MSC}_{1\text{c}}$ learns variational distributions that are less mass-covering for larger $d$ than $\text{DAIS}_0\ (16)$. $\text{MSC}_{8\text{c}}$ achieves sometimes higher densities in higher dimension $d$ but performs inconsistent across $N$. Results are discussed in \ref{['sec:exp:d-dim-blobs']}.
Figure 4: Densities on the diagonal between the two modes of the bimodal Gaussian distribution (same experiment as \ref{['fig:n-dim-gaussian-blobs']} with $N=1, d=3$). VI covers a single mode while $\text{DAIS}_0$ covers both modes with increasing $K$ (details in \ref{['sec:exp:d-dim-blobs']}).
Figure 5: Gaussian process regression on generated data, using a prior with RBF kernel with two different sets of parameters (see $\star$ in \ref{['tab:gp-inference']}). We show $97.5\%$ quantiles of the posterior covariance for analytic (shaded gray; covariance matrix is diagonalized on data points, see \ref{['sup:sec:gp-inference']}), IWVI (red), $\text{DAIS}_0$ (blue), and $\text{MSC}_{8\text{c}}$ (yellow). Learned means are indistinguishable from the analytic mean (black) at this line width. $\text{DAIS}_0$ often provides more accurate uncertainty estimates compared to the other methods (details in \ref{['sec:exp:gp-inference']}).
...and 2 more figures

Theorems & Definitions (5)

Theorem 3.1: Theorem 3 in domke2018importance
proof
Theorem 4.1
proof
proof

Differentiable Annealed Importance Sampling Minimizes The Symmetrized Kullback-Leibler Divergence Between Initial and Target Distribution

TL;DR

Abstract

Differentiable Annealed Importance Sampling Minimizes The Symmetrized Kullback-Leibler Divergence Between Initial and Target Distribution

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (5)