On Kernel-based Variational Autoencoder
Tian Qin, Wei-Min Huang
TL;DR
This work addresses the limited expressiveness of Gaussian posteriors in VAEs by introducing a KDE-based posterior, and derives a computable upper bound on the KL term in the ELBO. It proves that the Epanechnikov kernel minimizes that bound asymptotically and implements EVAE using a location-scale reparameterization to sample from the KDE-based posterior. Empirically, EVAE yields improved reconstruction quality and sharper images on MNIST, Fashion-MNIST, CIFAR-10, and CelebA, particularly at higher latent dimensions, while maintaining competitive training times. The approach provides a principled, flexible alternative to Gaussian VAEs and establishes a bridge between KDE theory and variational inference, with potential extensions to tighter KL bounds and different kernel criteria.
Abstract
In this paper, we bridge Variational Autoencoders (VAEs) and kernel density estimations (KDEs) by approximating the posterior by KDEs and deriving an upper bound of the Kullback-Leibler (KL) divergence in the evidence lower bound (ELBO). The flexibility of KDEs makes the optimization of posteriors in VAEs possible, which not only addresses the limitations of Gaussian latent space in vanilla VAE but also provides a new perspective of estimating the KL-divergence in ELBO. Under appropriate conditions, we show that the Epanechnikov kernel is the optimal choice in minimizing the derived upper bound of KL-divergence asymptotically. Compared with Gaussian kernel, Epanechnikov kernel has compact support which should make the generated sample less noisy and blurry. The implementation of Epanechnikov kernel in ELBO is straightforward as it lies in the "location-scale" family of distributions where the reparametrization tricks can be directly employed. A series of experiments on benchmark datasets such as MNIST, Fashion-MNIST, CIFAR-10 and CelebA further demonstrate the superiority of Epanechnikov Variational Autoenocoder (EVAE) over vanilla VAE in the quality of reconstructed images, as measured by the FID score and Sharpness.
