Table of Contents
Fetching ...

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

Shaobin Zhuang, Yiwei Guo, Canmiao Fu, Zhipeng Huang, Zeyue Tian, Xiaohui Li, Fangyikang Wang, Ying Zhang, Chen Li, Yali Wang

TL;DR

WeTok introduces a powerful discrete visual tokenizer by developing Group-Wise Lookup-Free Quantization to scale codebooks without memory blowups and a Generative Decoder that models the distribution of images conditioned on tokens. The two-stage training couples a reconstruction-focused phase with a generative, noise-conditioned refinement, enabling high-fidelity recovery at very high compression ratios. Empirical results on ImageNet and MS-COCO demonstrate state-of-the-art reconstruction performance, including zero-shot rFID of 0.12 at 400% compression, and competitive generation quality when integrated into autoregressive frameworks. These advances show that discrete tokenizers can surpass continuous counterparts in fidelity while maintaining strong compression, with practical impact for efficient visual generation and transmission.

Abstract

Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoder (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratio. On the ImageNet 50k validation set, at a high-fidelity setting, WeTok achieves a record-low zero-shot rFID of 0.12, outperforming leading continuous tokenizers like FLUX-VAE (0.18) and SD-VAE 3.5 (0.19) with 400% compression ratio. Furthermore, in a high-compression regime, WeTok achieves a zero-shot rFID of 3.49 at a 768$\times$ compression ratio, substantially surpassing Cosmos, which scores 4.57 at only 50% our compression ratio. Code and models are available: https://github.com/zhuangshaobin/WeTok.

WeTok: Powerful Discrete Tokenization for High-Fidelity Visual Reconstruction

TL;DR

WeTok introduces a powerful discrete visual tokenizer by developing Group-Wise Lookup-Free Quantization to scale codebooks without memory blowups and a Generative Decoder that models the distribution of images conditioned on tokens. The two-stage training couples a reconstruction-focused phase with a generative, noise-conditioned refinement, enabling high-fidelity recovery at very high compression ratios. Empirical results on ImageNet and MS-COCO demonstrate state-of-the-art reconstruction performance, including zero-shot rFID of 0.12 at 400% compression, and competitive generation quality when integrated into autoregressive frameworks. These advances show that discrete tokenizers can surpass continuous counterparts in fidelity while maintaining strong compression, with practical impact for efficient visual generation and transmission.

Abstract

Visual tokenizer is a critical component for vision generation. However, the existing tokenizers often face unsatisfactory trade-off between compression ratios and reconstruction fidelity. To fill this gap, we introduce a powerful and concise WeTok tokenizer, which surpasses the previous leading tokenizers via two core innovations. (1) Group-wise lookup-free Quantization (GQ). We partition the latent features into groups, and perform lookup-free quantization for each group. As a result, GQ can efficiently overcome memory and computation limitations of prior tokenizers, while achieving a reconstruction breakthrough with more scalable codebooks. (2) Generative Decoder (GD). Different from prior tokenizers, we introduce a generative decoder with a prior of extra noise variable. In this case, GD can probabilistically model the distribution of visual data conditioned on discrete tokens, allowing WeTok to reconstruct visual details, especially at high compression ratio. On the ImageNet 50k validation set, at a high-fidelity setting, WeTok achieves a record-low zero-shot rFID of 0.12, outperforming leading continuous tokenizers like FLUX-VAE (0.18) and SD-VAE 3.5 (0.19) with 400% compression ratio. Furthermore, in a high-compression regime, WeTok achieves a zero-shot rFID of 3.49 at a 768 compression ratio, substantially surpassing Cosmos, which scores 4.57 at only 50% our compression ratio. Code and models are available: https://github.com/zhuangshaobin/WeTok.

Paper Structure

This paper contains 24 sections, 5 theorems, 23 equations, 15 figures, 39 tables.

Key Result

Proposition 3.1

For any choice of group $G$, the codebook entropy approximation error (as in Eq. eq:gfq_approximate) of our GQ method is smaller than that of the BSQ method.

Figures (15)

  • Figure 1: Zero-shot reconstruction comparison with state-of-the-art tokenizers. (a) Our WeTok establishes a new state-of-the-art trade-off between compression and reconstruction performance among the compared methods. (b) WeTok achieves a significant improvement in reconstruction quality over previous discrete tokenizers such as VQVAE and Open-MAGVIT2.
  • Figure 2: WeTok with Group-Wise Lookup-Free Quantization and Generative Decoder.
  • Figure 3: Quantization method ablation. GQ and LFQ are significantly better than BSQ.
  • Figure 4: Number of group ablation.$G$ refers to the number of group. The reconstruction performance of the model increases significantly with the increase of $G$.
  • Figure 5: Model architecture ablation.$C$ and $B$ refer to the number of base channel and residual block respectively. $C=256$ and $B=4$ achieve the best reconstruction performance.
  • ...and 10 more figures

Theorems & Definitions (15)

  • Proposition 3.1
  • Remark 3.2
  • Definition A.1
  • Remark A.2
  • Definition A.3
  • Remark A.4
  • Lemma A.5
  • proof
  • Lemma A.6
  • proof
  • ...and 5 more