Table of Contents
Fetching ...

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj

TL;DR

XQ-GAN tackles the challenge of flexible, high-quality image tokenization for both reconstruction and autoregressive generation. It introduces a modular, hierarchical quantization framework that combines VQ, RQ, MSVQ, PQ, LFQ, BSQ, and MSRQ, coupled with semantic alignment via DINOv2 or CLIP and adversarial guidance. The two pipelines, XQ-GAN-SC and XQ-GAN-V, enable spatially compressed and vanilla tokenizations, achieving state-of-the-art reconstruction metrics on ImageNet 256×256 and strong generation performance with scalable codebooks and alignment strategies. The work provides open-source weights and extensive experiments across ImageNet, LAION-400M, and IMed-361M, underscoring practical impact for community replication and downstream generative modeling.

Abstract

Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

TL;DR

XQ-GAN tackles the challenge of flexible, high-quality image tokenization for both reconstruction and autoregressive generation. It introduces a modular, hierarchical quantization framework that combines VQ, RQ, MSVQ, PQ, LFQ, BSQ, and MSRQ, coupled with semantic alignment via DINOv2 or CLIP and adversarial guidance. The two pipelines, XQ-GAN-SC and XQ-GAN-V, enable spatially compressed and vanilla tokenizations, achieving state-of-the-art reconstruction metrics on ImageNet 256×256 and strong generation performance with scalable codebooks and alignment strategies. The work provides open-source weights and extensive experiments across ImageNet, LAION-400M, and IMed-361M, underscoring practical impact for community replication and downstream generative modeling.

Abstract

Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.

Paper Structure

This paper contains 33 sections, 6 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Performance comparison of XQGAN and prior arts on ImageNet 256x256 reconstruction benchmark. We provide XQGAN's tokenizer for vanilla AR and VAR tian2024visualautoregressivemodelingscalable modeling. Our proposed XQGAN demonstrates superior performance against mainstream tokenizers for both AR and VAR tasks. XQ-GAN variants are named with XQ-{multi-scale (MS)}-{vector quantization (V), lookup free quantization (L), binary spherical quantization (B)}-{residual quantization (R)}$^{N}$-{product quantization (P)}$^{P}$ where $N$ and $P$ denotes residual depth and product branch number respectively.
  • Figure 2: Overview of XQ-GAN-SC pipeline with latent spatial compression (SC). In this pipeline, a vision transformer is adopted as the encoder and decoder. $P$ set of $K\times K$ learnable tokens are utilized as queries to query the image tokens. $P$ denotes the quantizer number where $P=1$ is the vanilla Vector Quantization esser2021taming setting and $P>1$ denotes a Product Quantization li2024imagefolder setting. $K\times K$ denotes the spatial resolution of the quantized latent. During decoding, $L\times L$ learnable tokens are leveraged to query the quantized tokens and then decoded to the reconstructed image. The cost of image decoding is independent to the quantizer number, making it suitable for generation tasks that only require decoding during inference.
  • Figure 3: XQ-GAN-V pipeline.
  • Figure 4: Visualization of Vector Quantization (VQ), Residual Quantization (RQ), Product Quantization (PQ), Lookup Free Quantization (LFQ), and Binary Spherical Quantization (BSQ) in a simple two-dimensional space. VQ equivalent to k-means clustering, partitions the space into Voronoi regions based on the nearest centroids. RQ refines this region iteratively by quantizing residual at each step. PQ quantized the space with the combination of several codewords on subspaces. LFQ projects the tokens into several binary subspaces and quantizes each subspace with $[-1, 1]$. BSQ applied an $L_2$ normalization on LFQ's subspace prior to quantization, resulting in a spherical quantization boundary.
  • Figure 5: Hierarchical quantizer design. With $P=1$, $N=1$, and VQ for residual quantizer, the quantizer is equivalent to the vanilla VQ esser2021taming.
  • ...and 2 more figures