Multivariate Variational Autoencoder
Mehmet Can Yavuz
TL;DR
The paper tackles the limitation of diagonal posteriors in VAEs by introducing MVAE, a tractable full-covariance posterior realized through a global coupling matrix and per-sample diagonal scales. This design yields a structured latent space with correlated factors while keeping a closed-form KL and a standard reparameterization. The authors propose a comprehensive, multi-criterion evaluation framework and demonstrate that MVAE improves calibration, clustering, and reconstruction across MNIST variants, Fashion-MNIST, and CIFAR datasets, especially at low to moderate latent dimensionality. They provide extensive qualitative analyses showing smoother latent traversals and sharper reconstructions, and release code and data splits to enable reproducible comparisons. Overall, MVAE offers a principled, scalable path toward more geometrically grounded and reliable latent representations in VAEs, with potential extensions to hierarchical and multimodal settings.
Abstract
Learning latent representations that are simultaneously expressive, geometrically well-structured, and reliably calibrated remains a central challenge for Variational Autoencoders (VAEs). Standard VAEs typically assume a diagonal Gaussian posterior, which simplifies optimization but rules out correlated uncertainty and often yields entangled or redundant latent dimensions. We introduce the Multivariate Variational Autoencoder (MVAE), a tractable full-covariance extension of the VAE that augments the encoder with sample-specific diagonal scales and a global coupling matrix. This induces a multivariate Gaussian posterior of the form $N(μ_φ(x), C \operatorname{diag}(σ_φ^2(x)) C^\top)$, enabling correlated latent factors while preserving a closed-form KL divergence and a simple reparameterization path. Beyond likelihood, we propose a multi-criterion evaluation protocol that jointly assesses reconstruction quality (MSE, ELBO), downstream discrimination (linear probes), probabilistic calibration (NLL, Brier, ECE), and unsupervised structure (NMI, ARI). Across Larochelle-style MNIST variants, Fashion-MNIST, and CIFAR-10/100, MVAE consistently matches or outperforms diagonal-covariance VAEs of comparable capacity, with particularly notable gains in calibration and clustering metrics at both low and high latent dimensions. Qualitative analyses further show smoother, more semantically coherent latent traversals and sharper reconstructions. All code, dataset splits, and evaluation utilities are released to facilitate reproducible comparison and future extensions of multivariate posterior models.
