Bounding Evidence and Estimating Log-Likelihood in VAE
Łukasz Struski, Marcin Mazur, Paweł Batorski, Przemysław Spurek, Jacek Tabor
TL;DR
The paper addresses the challenge of the variational gap between log-evidence and ELBO in VAE-like models by deriving general upper bounds for concave transforms $f(\mathbb{E} X)$ and refining them via importance sampling. The core contributions include a foundational bound $f(\mathbb{E} X) \le \mathbb{E}[f(X)+(Y-X)f'(X)]$, an additive IS-based tightening strategy with an optimal $C$, and improved bounds using a $g,h$-inequality framework, all specialized to $f=\log$ to estimate log-evidence. The authors provide theoretical guarantees, convergence properties, and practical procedures, then validate the approach through synthetic case studies and extensive VAE/IWAE experiments on MNIST, SVHN, and CelebA, showing favorable comparisons to CUBO, EUBO, and TVO bounds. While the method yields tighter bounds and useful model-evaluation tools, limitations include reliance on estimators and focus on VAEs, restricting immediate training-time applicability. Overall, the work advances principled quantification of the variational gap and offers a practical toolkit for comparing generative models trained with lower bounds.
Abstract
Many crucial problems in deep learning and statistical inference are caused by a variational gap, i.e., a difference between model evidence (log-likelihood) and evidence lower bound (ELBO). In particular, in a classical VAE setting that involves training via an ELBO cost function, it is difficult to provide a robust comparison of the effects of training between models, since we do not know a log-likelihood of data (but only its lower bound). In this paper, to deal with this problem, we introduce a general and effective upper bound, which allows us to efficiently approximate the evidence of data. We provide extensive theoretical and experimental studies of our approach, including its comparison to the other state-of-the-art upper bounds, as well as its application as a tool for the evaluation of models that were trained on various lower bounds.
