Table of Contents
Fetching ...

Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders

Hien Dang, Tho Tran, Tan Nguyen, Nhat Ho

TL;DR

The paper addresses posterior collapse in VAEs and extends theoretical analysis to linear CVAEs and two-layer HVAE architectures. By deriving global minimizers and identifying collapse conditions through SVD-based analyses of cross-term structures, the authors show that input-output correlation and learnable encoder variance are key drivers of collapse, with hyperparameters such as $\beta$ and $\eta_{dec}$ offering mitigation levers. Experiments on MNIST, synthetic data, and nonlinear CVAEs/MHVAE validate the theory and demonstrate practical strategies for reducing collapse, including decorrelation and separate latent mappings. The work broadens the understanding of latent-variable learning in structured VAEs and provides actionable guidelines for designing architectures and training schemes that preserve informative latent representations.

Abstract

The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments.

Beyond Vanilla Variational Autoencoders: Detecting Posterior Collapse in Conditional and Hierarchical Variational Autoencoders

TL;DR

The paper addresses posterior collapse in VAEs and extends theoretical analysis to linear CVAEs and two-layer HVAE architectures. By deriving global minimizers and identifying collapse conditions through SVD-based analyses of cross-term structures, the authors show that input-output correlation and learnable encoder variance are key drivers of collapse, with hyperparameters such as and offering mitigation levers. Experiments on MNIST, synthetic data, and nonlinear CVAEs/MHVAE validate the theory and demonstrate practical strategies for reducing collapse, including decorrelation and separate latent mappings. The work broadens the understanding of latent-variable learning in structured VAEs and provides actionable guidelines for designing architectures and training schemes that preserve informative latent representations.

Abstract

The posterior collapse phenomenon in variational autoencoder (VAE), where the variational posterior distribution closely matches the prior distribution, can hinder the quality of the learned latent variables. As a consequence of posterior collapse, the latent variables extracted by the encoder in VAE preserve less information from the input data and thus fail to produce meaningful representations as input to the reconstruction process in the decoder. While this phenomenon has been an actively addressed topic related to VAE performance, the theory for posterior collapse remains underdeveloped, especially beyond the standard VAE. In this work, we advance the theoretical understanding of posterior collapse to two important and prevalent yet less studied classes of VAE: conditional VAE and hierarchical VAE. Specifically, via a non-trivial theoretical analysis of linear conditional VAE and hierarchical VAE with two levels of latent, we prove that the cause of posterior collapses in these models includes the correlation between the input and output of the conditional VAE and the effect of learnable encoder variance in the hierarchical VAE. We empirically validate our theoretical findings for linear conditional and hierarchical VAE and demonstrate that these results are also predictive for non-linear cases with extensive experiments.
Paper Structure (33 sections, 5 theorems, 82 equations, 12 figures, 5 tables)

This paper contains 33 sections, 5 theorems, 82 equations, 12 figures, 5 tables.

Key Result

Theorem 1

Let ${\mathbf Z} := \mathbb{E}_{x} (x \tilde{x}^{\top}) = {\mathbf R} \Theta {\mathbf S}$ is the SVD of ${\mathbf Z}$ with singular values $\{ \theta_i \}_{i=1}^{d_0}$ in non-increasing order and define ${\mathbf V} := {\mathbf W} {\mathbf P}_{A} \Phi^{1/2}$. With unlearnable ${\mathbf \Sigma} = \op where ${\mathbf T} \in \mathbb{R}^{d_1 \times d_1}$ is an orthonormal matrix that sort the diagonal

Figures (12)

  • Figure 1: Graphical illustration of standard VAE, CVAE, and MHVAE with two latents.
  • Figure 2: Graphs of $(\epsilon, \delta)$-collapse with varied hyperparameters ($\delta = 0.05$). (a) For learnable ${\mathbf \Sigma}$, 3 (out of 5) latent dimensions collapse immediately at $\epsilon = 8 \times 10^{-5}$, while collapse does not happen with unlearnable ${\mathbf \Sigma} = {\mathbf I}$. (b) Larger value of $\beta$ or $\eta_{\text{dec}}$ makes more latent dimensions to collapse, and (c) Larger value of $\beta_2$ or $\eta_{\text{dec}}$ triggers more latent dimensions to collapse, whereas larger value of $\beta_1$ mitigates posterior collapse.
  • Figure 3: (a) Graphs of $(\epsilon, \delta)$-collapse for several CVAEs trained separately on each of three digit $\{1,9,7\}$ subsets of MNIST ($\delta=0.05$). Dataset with smaller $\theta_i$'s ($1 \rightarrow 9 \rightarrow 7$ in increasing order) has more collapsed dimensions, and (b) Samples reconstructed by nonlinear MHVAE. Smaller $\beta_2$ alleviates collapse and produces better samples, while smaller $\beta_1$ has the reverse effect.
  • Figure 4: Samples reconstructed by nonlinear MHVAE with different $(\beta_1, \beta_2)$ combinations. Smaller $\beta_2$ alleviates collapse and produces better samples, while smaller $\beta_1$ has the reverse effect.
  • Figure 5: Graph of $(\epsilon, \delta)$-collapse of ResNet-18 VAE model with learnable ${\mathbf \Sigma}$ and unlearnable ${\mathbf \Sigma} = \mathbf{I}$, $\eta_{\text{enc}} = 1$ ($\delta=0.01$). Learnable ${\mathbf \Sigma}$ suffers posterior collapse when most of the latent dimensions collapse to the prior at small $\epsilon$, while unlearnable ${\mathbf \Sigma}$ does not.
  • ...and 7 more figures

Theorems & Definitions (15)

  • Theorem 1: Unlearnable ${\mathbf \Sigma}$
  • Theorem 2: Learnable ${\mathbf \Sigma}$
  • Theorem 3: Unlearnable isotropic ${\mathbf \Sigma}_1$, Learnable ${\mathbf \Sigma}_2$
  • Remark 1
  • proof : Proof of Theorem \ref{['thm:VAE_fixed_sigma']}
  • Remark 2
  • Theorem 4: Learnable ${\mathbf \Sigma}$
  • proof : Proof of Theorem \ref{['thm:VAE_learnable_sigma']}
  • proof : Proof of Theorem \ref{['thm:1']}
  • Remark 3
  • ...and 5 more