Table of Contents
Fetching ...

MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

Mingkai Jia, Wei Yin, Xiaotao Hu, Jiaxin Guo, Xiaoyang Guo, Qian Zhang, Xiao-Xiao Long, Ping Tan

TL;DR

MGVQ tackles the reconstruction gap between VQ-VAE and VAEs by preserving the latent dimension and expanding discrete latent capacity through multi-group quantization with sub-codebooks. A nested masking training strategy enforces ordered, coarse-to-fine encoding, enabling massive increases in representation capacity without severe codebook collapse. The approach achieves state-of-the-art reconstruction on ImageNet 256p and 2K HD zero-shot benchmarks, outperforming both discrete tokenizers and continuous baselines like SD-VAE in PSNR and rFID. These results demonstrate the potential of high-fidelity, scalable discrete latent representations for HD image processing and broad generalization.

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

MGVQ: Could VQ-VAE Beat VAE? A Generalizable Tokenizer with Multi-group Quantization

TL;DR

MGVQ tackles the reconstruction gap between VQ-VAE and VAEs by preserving the latent dimension and expanding discrete latent capacity through multi-group quantization with sub-codebooks. A nested masking training strategy enforces ordered, coarse-to-fine encoding, enabling massive increases in representation capacity without severe codebook collapse. The approach achieves state-of-the-art reconstruction on ImageNet 256p and 2K HD zero-shot benchmarks, outperforming both discrete tokenizers and continuous baselines like SD-VAE in PSNR and rFID. These results demonstrate the potential of high-fidelity, scalable discrete latent representations for HD image processing and broad generalization.

Abstract

Vector Quantized Variational Autoencoders (VQ-VAEs) are fundamental models that compress continuous visual data into discrete tokens. Existing methods have tried to improve the quantization strategy for better reconstruction quality, however, there still exists a large gap between VQ-VAEs and VAEs. To narrow this gap, we propose MGVQ, a novel method to augment the representation capability of discrete codebooks, facilitating easier optimization for codebooks and minimizing information loss, thereby enhancing reconstruction quality. Specifically, we propose to retain the latent dimension to preserve encoded features and incorporate a set of sub-codebooks for quantization. Furthermore, we construct comprehensive zero-shot benchmarks featuring resolutions of 512p and 2k to evaluate the reconstruction performance of existing methods rigorously. MGVQ achieves the state-of-the-art performance on both ImageNet and 8 zero-shot benchmarks across all VQ-VAEs. Notably, compared with SD-VAE, we outperform them on ImageNet significantly, with rFID 0.49 v.s. 0.91, and achieve superior PSNR on all zero-shot benchmarks. These results highlight the superiority of MGVQ in reconstruction and pave the way for preserving fidelity in HD image processing tasks. Code will be publicly available at https://github.com/MKJia/MGVQ.

Paper Structure

This paper contains 18 sections, 5 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: Reconstruction performance comparison between VQ-VAEs and SD-VAE on ImageNet 256×256 benchmark with $16\times$ downsampling. The size of bubbles indicates the capacity, that is, the possibility of a token by sampling from the codebook. MGVQ-G8 with $8$ groups achieves a PSNR of $24.70$, evidently surpassing all others, with a large capacity of $2^{88}$. Qualitative results are illustrated where details are zoomed in for a better view.
  • Figure 2: An overview of MGVQ framework.MGVQ keeps a larger dimension $C_l$ of latent $\mathbf{z}$ and split it into $G$ sub-tokens, where each sub-token is quantized individually with sub-codebook $\mathcal{E}_i$. Sub-tokens are then combined to compose $\mathbf{z}_q$ and for decoding.
  • Figure 3: (i) Codebook points of well-trained VQ-VAE models, that have different codebook sizes or latent dimensions. (ii) Sub-codebook points in our proposed MGVQ . The group size is 4.Used points are shown in red, while dead points shown in blue. (i.a) A larger dimension and smaller size may lead to an anisotropic distribution and low usage. (i.b) A larger dimension and larger size show favor of certain directions, resulting in dead points in a specific area (lower right corner). (i.c) A smaller dimension and smaller size could be fully used without dead points. (i.d) A smaller dimension and larger size allow for a uniform spread, but there are more codes than needed, leading to some underutilization.
  • Figure 4: The process of nested masking.Gray blocks represents the masked last tokens, and other colors show active sub-groups.
  • Figure 5: Qualitative reconstruction images with $16\times$ downsampling on 2560 $\times$ 1440 UHDBench dataset. We crop a $360\times 360$ sub-region, and zoom in detailed textures using blue and yellow for better view.
  • ...and 3 more figures