LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Seyedmorteza Sadat; Jakob Buhmann; Derek Bradley; Otmar Hilliges; Romann M. Weber

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Seyedmorteza Sadat, Jakob Buhmann, Derek Bradley, Otmar Hilliges, Romann M. Weber

TL;DR

LiteVAE, a new autoencoder design for LDMs, which leverages the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality is introduced.

Abstract

Advances in latent diffusion models (LDMs) have revolutionized high-resolution image generation, but the design space of the autoencoder that is central to these systems remains underexplored. In this paper, we introduce LiteVAE, a new autoencoder design for LDMs, which leverages the 2D discrete wavelet transform to enhance scalability and computational efficiency over standard variational autoencoders (VAEs) with no sacrifice in output quality. We investigate the training methodologies and the decoder architecture of LiteVAE and propose several enhancements that improve the training dynamics and reconstruction quality. Our base LiteVAE model matches the quality of the established VAEs in current LDMs with a six-fold reduction in encoder parameters, leading to faster training and lower GPU memory requirements, while our larger model outperforms VAEs of comparable complexity across all evaluated metrics (rFID, LPIPS, PSNR, and SSIM).

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

TL;DR

Abstract

Paper Structure (50 sections, 4 equations, 9 figures, 24 tables)

This paper contains 50 sections, 4 equations, 9 figures, 24 tables.

Introduction
Related work
Diffusion models and LDMs
Wavelet transform
Background
Deep autoencoders
Discrete wavelet transform
Method
Model design
Self-modulated convolution
Training improvements
Training resolution
Improving the adversarial setup
Additional loss functions
Experiments
...and 35 more sections

Figures (9)

Figure 1: An overview of LiteVAE. The input image is first decomposed into multi-level wavelet coefficients, and each wavelet sub-band is separately processed via a feature-extraction network. The features are then combined via a feature-aggregation module to compute the final latent code, which is then transformed back into the image space by the decoder. We use a lightweight UNet architecture (top right) without spatial down/upsampling for feature extraction and aggregation. The decoder is a fully convolutional network similar to that in the Stable Diffusion VAE rombachHighResolutionImageSynthesis2022. LiteVAE's design allows it to be significantly more efficient than standard VAEs in LDMs while maintaining high reconstruction quality.
Figure 2: RGB visualization of the first three channels of a SD-VAE latent code.
Figure 3: Two examples of the feature maps from the final block of the decoder before and after removing group normalization layers. Using SMC blocks instead of group normalization allows the model to learn more balanced feature maps. The image is best viewed when zoomed in.
Figure 4: An example of the autoencoder reconstruction alongside the learned latent code by LiteVAE. We observe that LiteVAE maintains the image-like structure of SD-VAE.
Figure 5: Comparing the performance of LiteVAE with a normal VAE across different resolutions. LiteVAE shows less degradation in all metrics.
...and 4 more figures

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

TL;DR

Abstract

LiteVAE: Lightweight and Efficient Variational Autoencoders for Latent Diffusion Models

Authors

TL;DR

Abstract

Table of Contents

Figures (9)