Table of Contents
Fetching ...

Theoretical Convergence Guarantees for Variational Autoencoders

Sobihan Surendran, Antoine Godichon-Baggioni, Sylvain Le Corff

TL;DR

This work delivers non-asymptotic convergence guarantees for training Variational Autoencoders with SGD and Adam, establishing a fundamental rate of $\mathcal{O}(\log n/\sqrt{n})$ under realistic assumptions. It systematically treats multiple VAE instantiations—Linear, Deep Gaussian, $\beta$-VAE, IWAE—and extends to BBVI—providing a unified convergence framework with explicit dependencies on batch size, variational-sample count, and network depth. The analysis reveals practical tradeoffs: larger $B$ and $K$ speeds convergence but increases cost, while smaller $\beta$ and deeper networks demand careful architectural and activation choices (e.g., generalized soft-clipping) to preserve smoothness. Empirical results on CelebA and CIFAR-100 corroborate the theory and offer actionable guidelines for hyperparameter selection and activation design in VAE training.

Abstract

Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions. Despite their empirical success in various machine learning tasks, significant gaps remain in understanding their theoretical properties, particularly regarding convergence guarantees. This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Stochastic Gradient Descent and Adam algorithms. We derive a convergence rate of $\mathcal{O}(\log n / \sqrt{n})$, where $n$ is the number of iterations of the optimization algorithm, with explicit dependencies on the batch size, the number of variational samples, and other key hyperparameters. Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including $β$-VAE and IWAE. Additionally, we empirically illustrate the impact of hyperparameters on convergence, offering new insights into the theoretical understanding of VAE training.

Theoretical Convergence Guarantees for Variational Autoencoders

TL;DR

This work delivers non-asymptotic convergence guarantees for training Variational Autoencoders with SGD and Adam, establishing a fundamental rate of under realistic assumptions. It systematically treats multiple VAE instantiations—Linear, Deep Gaussian, -VAE, IWAE—and extends to BBVI—providing a unified convergence framework with explicit dependencies on batch size, variational-sample count, and network depth. The analysis reveals practical tradeoffs: larger and speeds convergence but increases cost, while smaller and deeper networks demand careful architectural and activation choices (e.g., generalized soft-clipping) to preserve smoothness. Empirical results on CelebA and CIFAR-100 corroborate the theory and offer actionable guidelines for hyperparameter selection and activation design in VAE training.

Abstract

Variational Autoencoders (VAE) are popular generative models used to sample from complex data distributions. Despite their empirical success in various machine learning tasks, significant gaps remain in understanding their theoretical properties, particularly regarding convergence guarantees. This paper aims to bridge that gap by providing non-asymptotic convergence guarantees for VAE trained using both Stochastic Gradient Descent and Adam algorithms. We derive a convergence rate of , where is the number of iterations of the optimization algorithm, with explicit dependencies on the batch size, the number of variational samples, and other key hyperparameters. Our theoretical analysis applies to both Linear VAE and Deep Gaussian VAE, as well as several VAE variants, including -VAE and IWAE. Additionally, we empirically illustrate the impact of hyperparameters on convergence, offering new insights into the theoretical understanding of VAE training.

Paper Structure

This paper contains 69 sections, 30 theorems, 264 equations, 8 figures, 1 table, 1 algorithm.

Key Result

Proposition 3.1

For all $\theta, \theta' \in \Theta$ and $\phi, \phi' \in \Phi$, where $L^{\mathsf{S}}$ and $L^{\mathsf{P}}$ are defined in Lemma lemma:ELBO_smooth_score and lemma:ELBO_smooth_pathwise respectively.

Figures (8)

  • Figure 1: Squared norm of gradients and Negative ELBO on the test set of the CelebA for VAE trained with Adam and generalized soft-clipping activation function. Bold lines represent the mean over 5 independent runs.
  • Figure 2: $\| \nabla \mathcal{L}(\theta_n, \phi_n) \|^{2}$ in $\beta$-VAE (on the left) and IWAE (on the right) trained with Adam. Bold lines represent the mean over 5 independent runs. The dashed curves correspond to the expected convergence rate $\mathcal{O}(\log n/\sqrt{n})$.
  • Figure 3: Negative ELBO in IWAE on the test set of the CelebA (on the left) and CIFAR-100 (on the right) trained with Adam. Bold lines represent the mean over 5 independent runs.
  • Figure 4: Illustration of the Architecture of a VAE with the Multivariate Gaussian.
  • Figure 5: $\| \nabla \mathcal{L}(\theta_n, \phi_n) \|^{2}$ in VAE trained with Adam for the baseline model, a model with an additional fully connected layer, and a model with an additional convolutional layer. Bold lines represent the mean over 5 independent runs. Figures are plotted on a logarithmic scale for better visualization.
  • ...and 3 more figures

Theorems & Definitions (57)

  • Proposition 3.1
  • Theorem 3.2
  • Theorem 3.3
  • Corollary 3.4
  • Proposition 3.5
  • Theorem 3.6
  • Theorem 3.7
  • Corollary 3.8
  • Proposition A.1
  • proof
  • ...and 47 more