On the Asymptotics of Importance Weighted Variational Inference
Badr-Eddine Cherief-Abdellatif, Randal Douc, Arnaud Doucet, Hugo Marival
TL;DR
This work provides the first rigorous asymptotic theory for Importance Weighted Variational Inference (IWVI). It establishes consistency for both the model parameter $\tilde{\theta}_n^k$ and the variational parameter $\tilde{\phi}_n^k$, with the latter converging to a variance-minimizing target, under weak moment conditions. It further proves asymptotic normality and efficiency of $\tilde{\theta}_n^k$ when the MC sample size $k$ grows fast enough relative to the data size $n$, revealing a phase transition in the required growth rate between $\sqrt{n}$ and $n$ depending on the smoothness of the importance weights via the reparameterization framework. Theoretical results are complemented by simulations that illustrate how IWVI can closely approximate the MLE and outperform MSLE variants under certain sampling regimes. Overall, the paper provides foundational guarantees for IWVI in large-sample and large-$k$ settings, underpinning its empirical success with solid asymptotic theory.
Abstract
For complex latent variable models, the likelihood function is not available in closed form. In this context, a popular method to perform parameter estimation is Importance Weighted Variational Inference. It essentially maximizes the expectation of the logarithm of an importance sampling estimate of the likelihood with respect to both the latent variable model parameters and the importance distribution parameters, the expectation being itself with respect to the importance samples. Despite its great empirical success in machine learning, a theoretical analysis of the limit properties of the resulting estimates is still lacking. We fill this gap by establishing consistency when both the Monte Carlo and the observed data sample sizes go to infinity simultaneously. We also establish asymptotic normality and efficiency under additional conditions relating the rate of growth between the Monte Carlo and the observed data samples sizes. We distinguish several regimes related to the smoothness of the importance ratio.
