Table of Contents
Fetching ...

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

Chuanxia Zheng, Long Tung Vuong, Jianfei Cai, Dinh Phung

TL;DR

MoVQ tackles artifacts in two-stage VQ-based image generation by introducing spatially conditional normalization that modulates quantized vectors, paired with multichannel quantization to expand representational capacity without enlarging the codebook. A fast, multichannel prior via MaskGIT (and autoregressive options) enables efficient and diverse generation in the second stage. Empirical results on FFHQ and ImageNet show improved reconstruction quality and competitive or superior generation fidelity with fewer parameters than several baselines. The approach remains simple and computationally efficient, highlighting the potential of better quantizers and spatial modulation in discrete latent space for high-fidelity image synthesis.

Abstract

Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.

MoVQ: Modulating Quantized Vectors for High-Fidelity Image Generation

TL;DR

MoVQ tackles artifacts in two-stage VQ-based image generation by introducing spatially conditional normalization that modulates quantized vectors, paired with multichannel quantization to expand representational capacity without enlarging the codebook. A fast, multichannel prior via MaskGIT (and autoregressive options) enables efficient and diverse generation in the second stage. Empirical results on FFHQ and ImageNet show improved reconstruction quality and competitive or superior generation fidelity with fewer parameters than several baselines. The approach remains simple and computationally efficient, highlighting the potential of better quantizers and spatial modulation in discrete latent space for high-fidelity image synthesis.

Abstract

Although two-stage Vector Quantized (VQ) generative models allow for synthesizing high-fidelity and high-resolution images, their quantization operator encodes similar patches within an image into the same index, resulting in a repeated artifact for similar adjacent regions using existing decoder architectures. To address this issue, we propose to incorporate the spatially conditional normalization to modulate the quantized vectors so as to insert spatially variant information to the embedded index maps, encouraging the decoder to generate more photorealistic images. Moreover, we use multichannel quantization to increase the recombination capability of the discrete codes without increasing the cost of model and codebook. Additionally, to generate discrete tokens at the second stage, we adopt a Masked Generative Image Transformer (MaskGIT) to learn an underlying prior distribution in the compressed latent space, which is much faster than the conventional autoregressive model. Experiments on two benchmark datasets demonstrate that our proposed modulated VQGAN is able to greatly improve the reconstructed image quality as well as provide high-fidelity image generation.
Paper Structure (31 sections, 6 equations, 10 figures, 3 tables)

This paper contains 31 sections, 6 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: 256$\times$256 image samples generated by the proposed MoVQ, with model trained on FFHQ.
  • Figure 2: Left: The quantizer architecture of our proposed MoVQ. We incorporate the spatially conditional normalization layer into the decoder, where the two convolution layers predict modulation parameters $\gamma$ and $\beta$ in a point-wise way to modulate the learned discrete structure information. Right: Masked image generation. Here, a bidirectional transformer is applied to estimate the underlying prior distribution on the discrete representation with multiple channels.
  • Figure 3: Reconstructions from different models. The numbers denote the represented latent size and learned codebook sizes, respectively. Compared to the latest state-of-the-art RQVAE lee2022autoregressive, our model dramatically improves the image quality in the first stage under the same compression ratio.
  • Figure 4: Top: original $256\times256\times3$ images, bottom: reconstructed images from our MoVQ with a $16\times16\times4$ latent representation in a discrete space. Zoom in to see the details.
  • Figure 5: Generated $256\times256$ images by our MoVQ.
  • ...and 5 more figures