Improving Conditional VAE with approximation using Normalizing Flows
Tuhin Subhra De
TL;DR
This work revisits conditional image generation with CVAEs by addressing two key weaknesses: blurry outputs and the difficulty of modeling the conditional latent distribution p(z|y). It introduces an analytically derived optimal decoder variance to reduce blurriness and employs normalizing flows, including affine-coupling based non-volume preserving flows, to flexibly estimate p(z|y). The proposed sigma-VAE with NF demonstrates improved attribute-conditioned generation on CelebA, reflected in better conditional fidelity and competitive likelihood metrics. Overall, the approach shows that classical VAE frameworks can regain competitive performance for conditional generation through principled variance control and expressive density estimation of the latent conditionals.
Abstract
Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 4% and increasing log likelihood by 7.6% than the previous case.
