XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Xiang Li; Kai Qiu; Hao Chen; Jason Kuen; Jiuxiang Gu; Jindong Wang; Zhe Lin; Bhiksha Raj

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

Xiang Li, Kai Qiu, Hao Chen, Jason Kuen, Jiuxiang Gu, Jindong Wang, Zhe Lin, Bhiksha Raj

TL;DR

XQ-GAN tackles the challenge of flexible, high-quality image tokenization for both reconstruction and autoregressive generation. It introduces a modular, hierarchical quantization framework that combines VQ, RQ, MSVQ, PQ, LFQ, BSQ, and MSRQ, coupled with semantic alignment via DINOv2 or CLIP and adversarial guidance. The two pipelines, XQ-GAN-SC and XQ-GAN-V, enable spatially compressed and vanilla tokenizations, achieving state-of-the-art reconstruction metrics on ImageNet 256×256 and strong generation performance with scalable codebooks and alignment strategies. The work provides open-source weights and extensive experiments across ImageNet, LAION-400M, and IMed-361M, underscoring practical impact for community replication and downstream generative modeling.

Abstract

Image tokenizers play a critical role in shaping the performance of subsequent generative models. Since the introduction of VQ-GAN, discrete image tokenization has undergone remarkable advancements. Improvements in architecture, quantization techniques, and training recipes have significantly enhanced both image reconstruction and the downstream generation quality. In this paper, we present XQ-GAN, an image tokenization framework designed for both image reconstruction and generation tasks. Our framework integrates state-of-the-art quantization techniques, including vector quantization (VQ), residual quantization (RQ), multi-scale residual quantization (MSVQ), product quantization (PQ), lookup-free quantization (LFQ), and binary spherical quantization (BSQ), within a highly flexible and customizable training environment. On the standard ImageNet 256x256 benchmark, our released model achieves an rFID of 0.64, significantly surpassing MAGVIT-v2 (0.9 rFID) and VAR (0.9 rFID). Furthermore, we demonstrate that using XQ-GAN as a tokenizer improves gFID metrics alongside rFID. For instance, with the same VAR architecture, XQ-GAN+VAR achieves a gFID of 2.6, outperforming VAR's 3.3 gFID by a notable margin. To support further research, we provide pre-trained weights of different image tokenizers for the community to directly train the subsequent generative models on it or fine-tune for specialized tasks.

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

TL;DR

Abstract

XQ-GAN: An Open-source Image Tokenization Framework for Autoregressive Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)