Improving Variational Autoencoder Estimation from Incomplete Data with Mixture Variational Families
Vaidotas Simkus, Michael U. Gutmann
TL;DR
This work analyzes training VAEs with incomplete data and shows that missingness inflates the latent posterior complexity, potentially biasing estimates if the variational family is insufficiently flexible. It introduces two families of variational mixtures: finite mixtures (MissVAE/MissSVAE/MissIWAE/MissSIWAE) and a decomposed imputation-based approach (DeMissVAE), the latter separating data imputation from model learning via an imputation distribution f^t(x_mis|x_obs). The paper derives objective bounds for both approaches, including CVI-based and marginalised bounds, and provides practical guidance for optimization with mixture components, including implicit reparameterisation and stratified sampling. Empirical results on synthetic MoG data, UCI datasets, and MNIST/Omniglot demonstrate that variational mixtures can improve VAE estimation under missing data, with performance depending on dataset and budget, and show that the decomposed method can yield well-structured latent spaces similar to fully observed data. Overall, the work advances robust VAE estimation under incomplete data by leveraging flexible variational mixtures and data-imputation strategies with clear theoretical and empirical support.
Abstract
We consider the task of estimating variational autoencoders (VAEs) when the training data is incomplete. We show that missing data increases the complexity of the model's posterior distribution over the latent variables compared to the fully-observed case. The increased complexity may adversely affect the fit of the model due to a mismatch between the variational and model posterior distributions. We introduce two strategies based on (i) finite variational-mixture and (ii) imputation-based variational-mixture distributions to address the increased posterior complexity. Through a comprehensive evaluation of the proposed approaches, we show that variational mixtures are effective at improving the accuracy of VAE estimation from incomplete data.
