Table of Contents
Fetching ...

Preventing Posterior Collapse with delta-VAEs

Ali Razavi, Aäron van den Oord, Ben Poole, Oriol Vinyals

TL;DR

This paper tackles posterior collapse in VAEs by enforcing a minimum information rate δ between the posterior and prior, without altering the ELBO objective. It introduces δ-VAEs with sequential latent variables using an AR(1) prior and an anti-causal encoder to ensure latents capture useful, future-relevant information, and derives a KL lower bound that guarantees a committed rate. The approach yields state-of-the-art or competitive density modeling on CIFAR-10 and ImageNet 32×32 while learning informative latent representations, and demonstrates effective latent usage in text (LM1B) with Transformer decoders. Overall, δ-VAE provides a practical and principled solution to posterior collapse enabling the fusion of powerful decoders with meaningful latent representations.

Abstract

Due to the phenomenon of "posterior collapse," current latent variable generative models pose a challenging design choice that either weakens the capacity of the decoder or requires augmenting the objective so it does not only maximize the likelihood of the data. In this paper, we propose an alternative that utilizes the most powerful generative models as decoders, whilst optimising the variational lower bound all while ensuring that the latent variables preserve and encode useful information. Our proposed $δ$-VAEs achieve this by constraining the variational family for the posterior to have a minimum distance to the prior. For sequential latent variable models, our approach resembles the classic representation learning approach of slow feature analysis. We demonstrate the efficacy of our approach at modeling text on LM1B and modeling images: learning representations, improving sample quality, and achieving state of the art log-likelihood on CIFAR-10 and ImageNet $32\times 32$.

Preventing Posterior Collapse with delta-VAEs

TL;DR

This paper tackles posterior collapse in VAEs by enforcing a minimum information rate δ between the posterior and prior, without altering the ELBO objective. It introduces δ-VAEs with sequential latent variables using an AR(1) prior and an anti-causal encoder to ensure latents capture useful, future-relevant information, and derives a KL lower bound that guarantees a committed rate. The approach yields state-of-the-art or competitive density modeling on CIFAR-10 and ImageNet 32×32 while learning informative latent representations, and demonstrates effective latent usage in text (LM1B) with Transformer decoders. Overall, δ-VAE provides a practical and principled solution to posterior collapse enabling the fusion of powerful decoders with meaningful latent representations.

Abstract

Due to the phenomenon of "posterior collapse," current latent variable generative models pose a challenging design choice that either weakens the capacity of the decoder or requires augmenting the objective so it does not only maximize the likelihood of the data. In this paper, we propose an alternative that utilizes the most powerful generative models as decoders, whilst optimising the variational lower bound all while ensuring that the latent variables preserve and encode useful information. Our proposed -VAEs achieve this by constraining the variational family for the posterior to have a minimum distance to the prior. For sequential latent variable models, our approach resembles the classic representation learning approach of slow feature analysis. We demonstrate the efficacy of our approach at modeling text on LM1B and modeling images: learning representations, improving sample quality, and achieving state of the art log-likelihood on CIFAR-10 and ImageNet .

Paper Structure

This paper contains 24 sections, 14 equations, 14 figures, 6 tables.

Figures (14)

  • Figure 1: Effect of $\delta$ in a toy model. Fitting an uncorrelated Gaussian for the posterior, $q_\phi(z)$, to a correlated Gaussian prior, $p_\alpha(z)$, by minimizing $D_{\mathrm{KL}}(q_\phi(z) \Vert p_\alpha(z))$ over $\phi$. Left: committed rate ($\delta)$ as a function of the prior squared correlation $\alpha$ and the dimensionality $n$. Right: contours of the optimal posterior and prior in 2d. As the correlation increases, the minimum rate grows.
  • Figure 2: Generative structures for the inference of sequential latent variables. The anti-causal structure introduces an inductive bias to encode in each latent variable information about the future
  • Figure 3: Random samples from our ImageNet $32\times 32$ model. Each column in Fig. \ref{['fig:imagenet-samez']} shows multiple samples from $p({\mathbf{x}}|{\mathbf{z}})$ for a fixed ${\mathbf{z}} \sim p_{aux}({\mathbf{z}})$. Each image in Fig. \ref{['fig:imagenet-aux']} is decoded using a different sample from $p_{aux}({\mathbf{z}})$.
  • Figure 4: Comparison of CIFAR-10 test performance of $\delta$-VAEs vs. models trained with free-bits and $\beta$-VAE across many rates. $\delta$-VAE is significantly more stable, achieves competitive density estimation results across different rates, and its learned representations perform better in the downstream linear classification task.
  • Figure 5: Architecture for images
  • ...and 9 more figures