Table of Contents
Fetching ...

Improving Conditional VAE with approximation using Normalizing Flows

Tuhin Subhra De

TL;DR

This work revisits conditional image generation with CVAEs by addressing two key weaknesses: blurry outputs and the difficulty of modeling the conditional latent distribution p(z|y). It introduces an analytically derived optimal decoder variance to reduce blurriness and employs normalizing flows, including affine-coupling based non-volume preserving flows, to flexibly estimate p(z|y). The proposed sigma-VAE with NF demonstrates improved attribute-conditioned generation on CelebA, reflected in better conditional fidelity and competitive likelihood metrics. Overall, the approach shows that classical VAE frameworks can regain competitive performance for conditional generation through principled variance control and expressive density estimation of the latent conditionals.

Abstract

Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 4% and increasing log likelihood by 7.6% than the previous case.

Improving Conditional VAE with approximation using Normalizing Flows

TL;DR

This work revisits conditional image generation with CVAEs by addressing two key weaknesses: blurry outputs and the difficulty of modeling the conditional latent distribution p(z|y). It introduces an analytically derived optimal decoder variance to reduce blurriness and employs normalizing flows, including affine-coupling based non-volume preserving flows, to flexibly estimate p(z|y). The proposed sigma-VAE with NF demonstrates improved attribute-conditioned generation on CelebA, reflected in better conditional fidelity and competitive likelihood metrics. Overall, the approach shows that classical VAE frameworks can regain competitive performance for conditional generation through principled variance control and expressive density estimation of the latent conditionals.

Abstract

Variational Autoencoders and Generative Adversarial Networks remained the state-of-the-art (SOTA) generative models until 2022. Now they are superseded by diffusion based models. Efforts to improve traditional models have stagnated as a result. In old-school fashion, we explore image generation with conditional Variational Autoencoders (CVAE) to incorporate desired attributes within the images. VAEs are known to produce blurry images with less diversity, we refer a method that solve this issue by leveraging the variance of the gaussian decoder as a learnable parameter during training. Previous works on CVAEs assumed that the conditional distribution of the latent space given the labels is equal to the prior distribution, which is not the case in reality. We show that estimating it using normalizing flows results in better image generation than existing methods by reducing the FID by 4% and increasing log likelihood by 7.6% than the previous case.

Paper Structure

This paper contains 20 sections, 51 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Graphical model for VAE and CVAE. The model at the center has its latent space independent of the labels. Whereas the model in the right has dependency. X is the input, y is the label/attribute and X' is reconstructed input.
  • Figure 2: Flow of the $\sigma-$ CVAE (NF) model during training. The blocks in yellow are trainable. Blocks in red depict the loss functions.
  • Figure 3: Reconstructions of the images from the test set by the models after training. The top row of each section marked in green border are the original images.
  • Figure 4: Flow of the $\sigma-$ CVAE (NF) model during inference or sampling a random image with some labels.
  • Figure 5: Comparison between the random images generated by the models under the described scenarios. The left text box contains the attributes on which the images were conditioned on. The ones in bolder dark maroon font are displayed prominently by NF-CVAE. The last row in red border shows generation from attributes that might not be present in real life or during training.