Bures-Wasserstein Importance-Weighted Evidence Lower Bound: Exposition and Applications
Peiwen Jiang, Takuo Matsubara, Minh-Ngoc Tran
TL;DR
The paper tackles the instability of IW-ELBO optimization in Euclidean space by recasting variational inference for Gaussian families on the Bures-Wasserstein (BW) geometry. It derives the Wasserstein gradient of IW-ELBO and its BW projection, proving that the gradient SNR scales as $\Omega(\sqrt{K})$, enabling stable optimization as $K$ grows, and extends the analysis to VR-IWAE with similar guarantees. A practical BW-IW-ELBO algorithm is developed with forward-Euler updates on the mean and covariance, and its mass-covering properties are demonstrated on challenging multimodal targets and a Bayesian logistic regression task, outperforming baselines. The work further extends to VR-IWAE, showing analogous gradient structure and optimization dynamics, confirming the practical value of transport-geometric methods for importance-weighted VI. Overall, the BW framework yields more robust tail coverage and faster convergence for Gaussian VI in complex posterior landscapes, with potential extensions to broader bounds and geometries.
Abstract
The Importance-Weighted Evidence Lower Bound (IW-ELBO) has emerged as an effective objective for variational inference (VI), tightening the standard ELBO and mitigating the mode-seeking behaviour. However, optimizing the IW-ELBO in Euclidean space is often inefficient, as its gradient estimators suffer from a vanishing signal-to-noise ratio (SNR). This paper formulates the optimisation of the IW-ELBO in Bures-Wasserstein space, a manifold of Gaussian distributions equipped with the 2-Wasserstein metric. We derive the Wasserstein gradient of the IW-ELBO and project it onto the Bures-Wasserstein space to yield a tractable algorithm for Gaussian VI. A pivotal contribution of our analysis concerns the stability of the gradient estimator. While the SNR of the standard Euclidean gradient estimator is known to vanish as the number of importance samples $K$ increases, we prove that the SNR of the Wasserstein gradient scales favourably as $Ω(\sqrt{K})$, ensuring optimisation efficiency even for large $K$. We further extend this geometric analysis to the Variational Rényi Importance-Weighted Autoencoder bound, establishing analogous stability guarantees. Experiments demonstrate that the proposed framework achieves superior approximation performance compared to other baselines.
