Table of Contents
Fetching ...

DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder

Jie Shi, Chenfei Wu, Jian Liang, Xiang Liu, Nan Duan

TL;DR

DiVAE proposes a VQ-VAE-like framework that uses a denoising diffusion decoder to reconstruct images from discrete embeddings, bridging VQ-VAE/VQ-GAN with diffusion-based generation. By inserting image embeddings into the diffusion UNet and training on ImageNet, it achieves state-of-the-art reconstruction quality and sharper, more photorealistic images. When coupled with an autoregressive generator for text-to-image tasks, the model produces high-fidelity samples and competitive FID on MSCOCO. The approach offers a scalable first-stage reconstruction in multi-stage image synthesis and highlights diffusion decoders as a strong alternative to GANs or autoregressive priors for high-detail image synthesis.

Abstract

Recently most successful image synthesis models are multi stage process to combine the advantages of different methods, which always includes a VAE-like model for faithfully reconstructing embedding to image and a prior model to generate image embedding. At the same time, diffusion models have shown be capacity to generate high-quality synthetic images. Our work proposes a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis. We explore how to input image embedding into diffusion model for excellent performance and find that simple modification on diffusion's UNet can achieve it. Training on ImageNet, Our model achieves state-of-the-art results and generates more photorealistic images specifically. In addition, we apply the DiVAE with an Auto-regressive generator on conditional synthesis tasks to perform more human-feeling and detailed samples.

DiVAE: Photorealistic Images Synthesis with Denoising Diffusion Decoder

TL;DR

DiVAE proposes a VQ-VAE-like framework that uses a denoising diffusion decoder to reconstruct images from discrete embeddings, bridging VQ-VAE/VQ-GAN with diffusion-based generation. By inserting image embeddings into the diffusion UNet and training on ImageNet, it achieves state-of-the-art reconstruction quality and sharper, more photorealistic images. When coupled with an autoregressive generator for text-to-image tasks, the model produces high-fidelity samples and competitive FID on MSCOCO. The approach offers a scalable first-stage reconstruction in multi-stage image synthesis and highlights diffusion decoders as a strong alternative to GANs or autoregressive priors for high-detail image synthesis.

Abstract

Recently most successful image synthesis models are multi stage process to combine the advantages of different methods, which always includes a VAE-like model for faithfully reconstructing embedding to image and a prior model to generate image embedding. At the same time, diffusion models have shown be capacity to generate high-quality synthetic images. Our work proposes a VQ-VAE architecture model with a diffusion decoder (DiVAE) to work as the reconstructing component in image synthesis. We explore how to input image embedding into diffusion model for excellent performance and find that simple modification on diffusion's UNet can achieve it. Training on ImageNet, Our model achieves state-of-the-art results and generates more photorealistic images specifically. In addition, we apply the DiVAE with an Auto-regressive generator on conditional synthesis tasks to perform more human-feeling and detailed samples.
Paper Structure (15 sections, 23 equations, 10 figures, 5 tables)

This paper contains 15 sections, 23 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: DiVAE generates more photorealistic and detailed images.
  • Figure 2: Overall framework of SOTA two stage image synthesis models, comparing the approach with DiVAE.
  • Figure 3: Structure of DiVAE, which uses a convolutional encoder and a denoising diffusion decoder, and the key Network is an UNet to model $p({x}_{t-1}|{x}_{t},z)$
  • Figure 4: Comparison with VQGAN in $32^2 \rightarrow 256^2$ and $16^2 \rightarrow 256^2$ reconstruction with same codebook dimension $Z$.
  • Figure 5: Samples of Text-to-Image (T2I) task generated by DiVAE.
  • ...and 5 more figures