Table of Contents
Fetching ...

The ELBO of Variational Autoencoders Converges to a Sum of Three Entropies

Simon Damm, Dennis Forster, Dmytro Velychko, Zhenwen Dai, Asja Fischer, Jörg Lücke

TL;DR

This work proves that for standard Gaussian VAEs, the ELBO at any stationary point equals the sum of three entropies: the encoder entropy, the prior entropy, and the decoder entropy, making the bound computable in closed form from the encoder and decoder variances. It introduces a reparameterized VAE (VAE-2) with a learnable prior covariance to connect to the classic VAE-1 and extends the result to general Gaussian VAEs (including VAE-3 with latent-dependent decoder covariance). The authors validate the theory with extensive experiments across linear, nonlinear, and complex VAEs on diverse data, showing the entropy-sum bound tracks the ELBO with high accuracy near convergence and offering entropy-based tools for ELBO estimation, model selection, and posterior collapse analysis. The entropy perspective provides a principled framework to interpret VAE learning dynamics, connect optimization to the volumes of typical sets, and enable practical methods for monitoring and selecting models in streaming and large-scale settings.

Abstract

The central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and understand learning behavior. More generally, we believe that our contributions can be useful for future theoretical and practical studies on VAE learning as they provide novel information on those points in parameters space that optimization of VAEs converges to.

The ELBO of Variational Autoencoders Converges to a Sum of Three Entropies

TL;DR

This work proves that for standard Gaussian VAEs, the ELBO at any stationary point equals the sum of three entropies: the encoder entropy, the prior entropy, and the decoder entropy, making the bound computable in closed form from the encoder and decoder variances. It introduces a reparameterized VAE (VAE-2) with a learnable prior covariance to connect to the classic VAE-1 and extends the result to general Gaussian VAEs (including VAE-3 with latent-dependent decoder covariance). The authors validate the theory with extensive experiments across linear, nonlinear, and complex VAEs on diverse data, showing the entropy-sum bound tracks the ELBO with high accuracy near convergence and offering entropy-based tools for ELBO estimation, model selection, and posterior collapse analysis. The entropy perspective provides a principled framework to interpret VAE learning dynamics, connect optimization to the volumes of typical sets, and enable practical methods for monitoring and selecting models in streaming and large-scale settings.

Abstract

The central objective function of a variational autoencoder (VAE) is its variational lower bound (the ELBO). Here we show that for standard (i.e., Gaussian) VAEs the ELBO converges to a value given by the sum of three entropies: the (negative) entropy of the prior distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions (the latter is already part of the ELBO). Our derived analytical results are exact and apply for small as well as for intricate deep networks for encoder and decoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local maxima and saddle points). The result implies that the ELBO can for standard VAEs often be computed in closed-form at stationary points while the original ELBO requires numerical approximations of integrals. As a main contribution, we provide the proof that the ELBO for VAEs is at stationary points equal to entropy sums. Numerical experiments then show that the obtained analytical results are sufficiently precise also in those vicinities of stationary points that are reached in practice. Furthermore, we discuss how the novel entropy form of the ELBO can be used to analyze and understand learning behavior. More generally, we believe that our contributions can be useful for future theoretical and practical studies on VAE learning as they provide novel information on those points in parameters space that optimization of VAEs converges to.

Paper Structure

This paper contains 37 sections, 4 theorems, 68 equations, 10 figures.

Key Result

Theorem 1

Given a VAE--2 as in def:VAE-2 that satisfies assumption:linear_mapping. At all stationary points the ELBO ${\mathcal{F}}(\Phi,\Theta)$ of VAE--2 is then equal to

Figures (10)

  • Figure 1: Verification of the entropy results on VAE models of increasing complexity on different data sets. Top plots: Absolute values of the given bounds per data point of single runs. In (a) the ELBO is essentially equal to the log-likelihood (zoom in to see). In (c) both quantities are displayed for three different seeds. Bottom plots: Median and interquartile range of the relative difference between the ELBO and the sum of three entropies over multiple runs ($10$ for (a) and (b), $100$ for (c)). See \ref{['fig:Verification_App']} for further architectures and data sets, and \ref{['app:Experiments']} for details on the experiments.
  • Figure 2: ELBO Estimation for the VAE--1 model on CelebA. (a) ELBO and the sum of the three entropies during training. The close-up reveals that the ELBO of mini-batches fluctuates around the sum of entropies. (b) Direct approximation of ELBO and three entropies for the trained model with different sample sizes, repeated 10 times. Note, that the standard deviation is depicted on logarithmic scale.
  • Figure 3: (a) Streaming VAE application. Three linear VAEs trained on streaming data with changing dimensionality. BIC score based on online ELBO estimate (transparent plots) is very noisy. The three entropies BIC score (smooth solid plots) is a estimator depending only on the model parameters. Notice, that it clearly allows for easier and more stable model selection (the lower the BIC score, the better the model). (b) Posterior collapse monitoring for VAE--1 on SUSY. Latent variables are collapsed if $\frac{1}{N}\sum_{n=1}^N {\mathcal{H}}[q_{\Phi}(z_h\vert\mathbf{x}^{\space(n)})] > 1$ (\ref{['EqnPostCollCriterion']}). See \ref{['app:PosteriorCollapse', 'app:Experiments']} for details on posterior collapse and experimental set-up.
  • Figure 4: Visualization of the variational lower bound and its relation to the three entropies expression. The figure shows a two dimensional visualization with the following axes: the x-axis represents the hyperplane of variance parameters $\alpha^2_1$ to $\alpha^2_H$ together with decoder variance $\sigma^2$. We assume a VAE of type VAE--1 for the figure. The $\alpha^2_1$ to $\alpha^2_H$ we take to be implicitly defined by the decoder weights (compare VAE--2). The dotted black line represents a submanifold in which the parameters of the x-axis have converged. Within the submanifold, the variational lower bound is equal to a sum of three entropies (\ref{['theo:VAE-1']}). In the illustrated example, the submanifold connects a stationary point without posterior collapse to a stationary point with (partial) posterior collapse. Depending on the location on the manifold, learning is dominated by the change in the reconstruction score $S_{\mathrm{rec}}(\Theta)$ (green arrow) or dominated by the change in the regularization score $S_{\mathrm{reg}}(\Phi,\Theta)$ (red arrow), with qualitatively different outcomes. As both scores can be defined based on entropies (see \ref{['eq:Scores_Reg_Rec']}), changes of the scores directly translate to changes in entropies such that the optimization landscapes can be characterized using changes in entropies. Let us denote by $|\Delta{}{\mathcal{H}}[p_{\Theta}^{\mathrm{dec}}]|$ the absolute change of the decoder entropy, while we denote by $|\Delta{}\bar{{\mathcal{H}}}[q_{\Phi}^{\mathrm{enc}}]|$ the absolute average change of the encoder entropy, i.e., $\bar{{\mathcal{H}}}[q_{\Phi}^{\mathrm{enc}}]=\frac{1}{N}\sum_n{}{\mathcal{H}}[q_{\Phi}^{\mathrm{enc}}(\mathbf{z}\vert\mathbf{x}^{\space(n)})]$. If we start at the high local optimum and traverse the submanifold from left to right then the reconstruction score $S_{\mathrm{rec}}(\Theta)$ decreases while the regularization score $S_{\mathrm{reg}}(\Phi,\Theta)$ will tend to increase (to a lesser extent). Translated to entropies, we have within the submanifold between high maximum and saddle point $|\Delta{}\bar{{\mathcal{H}}}[q_{\Phi}^{\mathrm{enc}}]|<|\Delta{}{\mathcal{H}}[p_{\Theta}^{\mathrm{dec}}]|$, and ELBO optimization would favor reconstruction improvements. At the saddle point (the turning point), the changes of the entropies become equal. To the right of the turning point, learning is dominated by the regularization term, which means that the change in encoder entropy is dominant $|\Delta{}\bar{{\mathcal{H}}}[q_{\Phi}^{\mathrm{enc}}]|>|\Delta{}{\mathcal{H}}[p_{\Theta}^{\mathrm{dec}}]|$. ELBO optimization in this part of the manifold would result in (partial) posterior collapse. $|\Delta{}\bar{{\mathcal{H}}}[q_{\Phi}^{\mathrm{enc}}]|=|\Delta{}{\mathcal{H}}[p_{\Theta}^{\mathrm{dec}}]|$ At the local optimum with (partly) collapsed posterior, the encoder entropy, $\bar{{\mathcal{H}}}_h[q_{\Phi}^{\mathrm{enc}}]$, of a collapsed latent $h$ will be equal to the prior entropy.
  • Figure 5: Additional (verification) experiments on artificial manifold data, SUSY and CelebA with VAE--1 models. Depicted are the lower bound (ELBO), the lower bound on held-out-test data and the sum of entropies. The lower plot depicts the relative difference with an exponential moving average in dark blue. We remark the the ELBO fluctuates around the three entropies expression. See \ref{['app:ExpSpecifications']} for further details.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Definition 1: VAE--1; VAE with component-wise equivalent decoder variances
  • Definition 2: VAE--2, VAE with component-wise equivalent decoder variances and learnable prior covariance
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Corollary 1
  • proof
  • Definition 3: VAE--3; VAE with latent dependent diagonal decoder covariance
  • Theorem 3
  • ...and 1 more