Table of Contents
Fetching ...

DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

Dongxu Liu, Jiahui Zhu, Yuang Peng, Haomiao Tang, Yuwei Chen, Chunrui Han, Zheng Ge, Daxin Jiang, Mingxue Liao

TL;DR

DGAE addresses the trade-off between spatial compression and reconstruction fidelity by moving the diffusion model into the decoder, making the latent space more compact without sacrificing detail. The approach replaces the conventional Gaussian decoder with a conditional diffusion decoder guided by latent z, and optimizes a score-based objective alongside KL and perceptual losses. Empirical results show DGAE preserves high-frequency textures at smaller latents, scales effectively with larger decoders, and yields latent representations that enable faster convergence for latent diffusion models on ImageNet-1K. This diffusion-guided decoding offers a stable, efficient path to high-quality reconstruction and rapid diffusion-based generation with reduced latent dimensionality. The work highlights the decoder’s central role in autoencoders and demonstrates practical benefits for downstream diffusion training and high-resolution image synthesis.

Abstract

Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.

DGAE: Diffusion-Guided Autoencoder for Efficient Latent Representation Learning

TL;DR

DGAE addresses the trade-off between spatial compression and reconstruction fidelity by moving the diffusion model into the decoder, making the latent space more compact without sacrificing detail. The approach replaces the conventional Gaussian decoder with a conditional diffusion decoder guided by latent z, and optimizes a score-based objective alongside KL and perceptual losses. Empirical results show DGAE preserves high-frequency textures at smaller latents, scales effectively with larger decoders, and yields latent representations that enable faster convergence for latent diffusion models on ImageNet-1K. This diffusion-guided decoding offers a stable, efficient path to high-quality reconstruction and rapid diffusion-based generation with reduced latent dimensionality. The work highlights the decoder’s central role in autoencoders and demonstrates practical benefits for downstream diffusion training and high-resolution image synthesis.

Abstract

Autoencoders empower state-of-the-art image and video generative models by compressing pixels into a latent space through visual tokenization. Although recent advances have alleviated the performance degradation of autoencoders under high compression ratios, addressing the training instability caused by GAN remains an open challenge. While improving spatial compression, we also aim to minimize the latent space dimensionality, enabling more efficient and compact representations. To tackle these challenges, we focus on improving the decoder's expressiveness. Concretely, we propose DGAE, which employs a diffusion model to guide the decoder in recovering informative signals that are not fully decoded from the latent representation. With this design, DGAE effectively mitigates the performance degradation under high spatial compression rates. At the same time, DGAE achieves state-of-the-art performance with a 2x smaller latent space. When integrated with Diffusion Models, DGAE demonstrates competitive performance on image generation for ImageNet-1K and shows that this compact latent representation facilitates faster convergence of the diffusion model.

Paper Structure

This paper contains 15 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) Scaling up the discriminator in GANs can mitigate the decline in reconstruction accuracy of autoencoders under high spatial compression rates, while also enhancing reconstruction performance at low spatial compression rates. (b) Scaling up the decoder effectively improves the reconstruction quality of the autoencoder, while scaling up the encoder has little effect.
  • Figure 2: DGAE is a diffusion-guided autoencoder, which is dedicated to enhancing the decoding capability of the decoder. Compared with GAN-guided methods, the latent representation $z$ is no longer used for direct image reconstruction. Instead, it serves as a supervisory signal for the decoder, thereby better constraining $p(x|z)$ to the data distribution $p(x)$.
  • Figure 3: Reconstructed samples of DGAE and SD-VAE. These results suggest that, despite employing a simpler combination of losses, DGAE benefits from the strong modeling capacity of the diffusion decoder, leading to more effective recovery of fine-grained details such as textures and structural patterns.
  • Figure 4: Reconstruction samples with different latent sizes. The result was obtained under a fixed spatial compression rate of f16, with the channel dimension of the latent representation gradually decreased. As the latent size decreases, SD-VAE tends to collapse, while DGAE still maintains a high fidelity.
  • Figure 5: Scalability Evaluation of DGAE. By scaling up the decoder, DGAE achieves better reconstruction quality with enhanced detail preservation.
  • ...and 3 more figures