Table of Contents
Fetching ...

Taming VAEs

Danilo Jimenez Rezende, Fabio Viola

TL;DR

This work addresses the instability of training high-capacity VAEs by introducing GECO, a constrained optimization framework that uses an augmented Lagrangian to enforce interpretable reconstruction constraints. The authors provide a theoretical treatment of constrained VAEs, connect β-VAEs to spectral clustering and phase transitions, and show that GECO yields robust, constraint-driven control over reconstruction versus compression. Empirically, GECO improves latent-space coverage (lower marginal KL) and maintains reconstruction quality across large-scale models and datasets without extensive hyperparameter sweeps. The approach offers a practical workflow for tuning VAEs in a data-space–oriented manner with broad applicability to complex generative models.

Abstract

In spite of remarkable progress in deep latent variable generative modeling, training still remains a challenge due to a combination of optimization and generalization issues. In practice, a combination of heuristic algorithms (such as hand-crafted annealing of KL-terms) is often used in order to achieve the desired results, but such solutions are not robust to changes in model architecture or dataset. The best settings can often vary dramatically from one problem to another, which requires doing expensive parameter sweeps for each new case. Here we develop on the idea of training VAEs with additional constraints as a way to control their behaviour. We first present a detailed theoretical analysis of constrained VAEs, expanding our understanding of how these models work. We then introduce and analyze a practical algorithm termed Generalized ELBO with Constrained Optimization, GECO. The main advantage of GECO for the machine learning practitioner is a more intuitive, yet principled, process of tuning the loss. This involves defining of a set of constraints, which typically have an explicit relation to the desired model performance, in contrast to tweaking abstract hyper-parameters which implicitly affect the model behavior. Encouraging experimental results in several standard datasets indicate that GECO is a very robust and effective tool to balance reconstruction and compression constraints.

Taming VAEs

TL;DR

This work addresses the instability of training high-capacity VAEs by introducing GECO, a constrained optimization framework that uses an augmented Lagrangian to enforce interpretable reconstruction constraints. The authors provide a theoretical treatment of constrained VAEs, connect β-VAEs to spectral clustering and phase transitions, and show that GECO yields robust, constraint-driven control over reconstruction versus compression. Empirically, GECO improves latent-space coverage (lower marginal KL) and maintains reconstruction quality across large-scale models and datasets without extensive hyperparameter sweeps. The approach offers a practical workflow for tuning VAEs in a data-space–oriented manner with broad applicability to complex generative models.

Abstract

In spite of remarkable progress in deep latent variable generative modeling, training still remains a challenge due to a combination of optimization and generalization issues. In practice, a combination of heuristic algorithms (such as hand-crafted annealing of KL-terms) is often used in order to achieve the desired results, but such solutions are not robust to changes in model architecture or dataset. The best settings can often vary dramatically from one problem to another, which requires doing expensive parameter sweeps for each new case. Here we develop on the idea of training VAEs with additional constraints as a way to control their behaviour. We first present a detailed theoretical analysis of constrained VAEs, expanding our understanding of how these models work. We then introduce and analyze a practical algorithm termed Generalized ELBO with Constrained Optimization, GECO. The main advantage of GECO for the machine learning practitioner is a more intuitive, yet principled, process of tuning the loss. This involves defining of a set of constraints, which typically have an explicit relation to the desired model performance, in contrast to tweaking abstract hyper-parameters which implicitly affect the model behavior. Encouraging experimental results in several standard datasets indicate that GECO is a very robust and effective tool to balance reconstruction and compression constraints.

Paper Structure

This paper contains 24 sections, 6 theorems, 13 equations, 11 figures, 2 tables, 1 algorithm.

Key Result

Proposition 1

(Fixed-point equations) The extrema of the ELBO $\mathcal{F}$ with respect to the decoder $g({\mathbf{z}})$ and encoder $q({\mathbf{z}} | {\mathbf{x}})$ are solutions of the fixed-point eq.fp.q.elboeq.fp.g.elbo and for the Lagrangian $\mathcal{L}_{\boldsymbol{\lambda}}$, eq.fp.qeq.fp.g respectively where $H({\mathbf{x}}, {\mathbf{z}}) = \boldsymbol{\lambda}^T\mathcal{C}( {\mathbf{x}}, g^{t-1}_{\m

Figures (11)

  • Figure 1: Illustration of the "blurred reconstructions" and the "holes" problems.Left: Latent space with a posterior with support in a tiling $\{\Omega_i\}$, where each tile $\Omega_i$ represents the support of the posterior for the data-point $x_i$.; Right: Data space. In the region of the latent space where the posteriors of the data-points $x_1$ and $x_2$ overlap, $\Omega_1 \cap \Omega_2$, the optimal reconstruction $\hat{x}$ is a weighted average of the corresponding data-points, resulting in a blurred sample. In a region of low density under the marginal posterior, a "hole" (represented by the black area in the figure), the optimal reconstructions from these regions $\tilde{x}$ are unconstrained by the ELBO objective function.
  • Figure 2: Effect of $\beta$ on the reconstruction fixed-points and phase-transitions. Top images Grey curves indicate the trajectories of the vectors $\psi$. Red points are the fixed-points. Blue points are the data points; Bottom left Expected reconstruction error as a function of $\beta$. Vertical grey lines indicate the detected critical-temperatures $\beta_c^k$; Bottom right Expected second-order derivative of the reconstruction error as a function of $\beta$. At critical temperatures reconstruction fixed-points will merge with each other, resulting in sudden changes in the slope of the reconstruction error with respect to the temperature $\beta$, these points correspond to spikes in the second-order derivatives. For analysis, we sorted the local maxima according to their height and restricted the analysis to the top-3 points, $\beta_c^{k=1,2,3}$. Details of this simulation are explained in \ref{['sec.experiments.fixedpoints']}.
  • Figure 3: Trajectory in the information plane induced by GECO during training. This plot shows a typical trajectory in the NLL/KL plane for a model trained using GECO with a RE constraint, alongside the corresponding values of the equivalent $\beta$ and pixel reconstruction errors; note that iteration information is consistently encoded using color in the three plots. At the beginning of training, $it < 10^4$, the reconstruction constraint dominates optimization, with $\beta < 1$ implicitly amplifying the NLL term in ELBO. When the inequality constraint is met, i.e. the reconstruction error curve crosses the $\kappa$ threshold horizontal line, $\beta$ slowly starts changing, modulated by the moving average, until at $it = 10^4$, the $\beta$ curve flexes and $\beta$ starts growing. This specific example is for a conditional ConvDraw model trained on MNIST-rotate.
  • Figure 4: Information plane analysis of Conditional ConvDraw, ConvDraw and VAE+NVP with and without RE constraints. Each plot shows the final reconstruction / compression trade-off achieved during training for the same ConvDraw and VAE+NVP models using ELBO, GECO and ELBO with a hand annealed $\beta$, respectively. For GECO we report results for the following reconstruction thresholds $\kappa \in \{0.06, 0.08, 0.1, 0.125, 0.175\}$, and visually tie them together by connecting them via a line colour-coded by the dataset instance they refer to. For the hand annealed $\beta$ we use the same annealing scheme reported in GQN. Results are shown for a variety of conditional and unconditional datasets, providing evidence of the consistency of the behavior of GECO across different domains.
  • Figure 5: GECO results in lower average KL at fixed reconstruction error compared to ELBO. We first trained an expressive ConvDRAW model on CIFAR10 using the standard ELBO objective until convergence and recorded its reconstruction error (MSE=0.00029). At this reconstruction error values, the reconstructions are visually perfect. We then trained the same model architecture using GECO with a RE constraint setup to achieve the same reconstruction error. The curves for the model trained with ELBO (green) and with GECO (blue) demonstrate that we can achieve the same reconstruction error but with a lower average KL between prior and posterior.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 5
  • Proposition 6
  • proof
  • proof
  • proof
  • proof
  • ...and 2 more