Table of Contents
Fetching ...

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

Sinan Du, Jiahao Guo, Bo Li, Shuhao Cui, Zhengzhuo Xu, Yifu Luo, Yongxian Wei, Kun Gai, Xinggang Wang, Kai Wu, Chun Yuan

TL;DR

The paper introduces VQRAE, a unified tokenizer that simultaneously provides continuous semantic features for multimodal understanding and discrete tokens for generation and reconstruction. Built on pretrained Vision Foundation Models and a symmetric ViT decoder, it trains a high‑dimensional semantic codebook with a two‑stage process and self‑distillation to maintain understanding while enabling high‑quality image reconstruction and generation. By eliminating convolutional pixel encoders and enabling direct integration with existing MLLMs, VQRAE achieves competitive performance across understanding, generation, and reconstruction benchmarks and demonstrates strong scaling potential for autoregressive multimodal models.

Abstract

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

VQRAE: Representation Quantization Autoencoders for Multimodal Understanding, Generation and Reconstruction

TL;DR

The paper introduces VQRAE, a unified tokenizer that simultaneously provides continuous semantic features for multimodal understanding and discrete tokens for generation and reconstruction. Built on pretrained Vision Foundation Models and a symmetric ViT decoder, it trains a high‑dimensional semantic codebook with a two‑stage process and self‑distillation to maintain understanding while enabling high‑quality image reconstruction and generation. By eliminating convolutional pixel encoders and enabling direct integration with existing MLLMs, VQRAE achieves competitive performance across understanding, generation, and reconstruction benchmarks and demonstrates strong scaling potential for autoregressive multimodal models.

Abstract

Unifying multimodal understanding, generation and reconstruction representation in a single tokenizer remains a key challenge in building unified models. Previous research predominantly attempts to address this in a dual encoder paradigm, e.g., utilizing the separate encoders for understanding and generation respectively or balancing semantic representations and low-level features with contrastive loss. In this paper, we propose VQRAE, a Vector Quantization version of Representation AutoEncoders, which pioneers the first exploration in unified representation to produce Continuous semantic features for image understanding and Discrete tokens for visual generation within a unified tokenizer. Specifically, we build upon pretrained vision foundation models with a symmetric ViT decoder and adopt a two-stage training strategy: first, it freezes the encoder and learns a high-dimensional semantic VQ codebook with pixel reconstruction objective; then jointly optimizes the encoder with self-distillation constraints. This design enables negligible semantic information for maintaining the ability of multimodal understanding, discrete tokens that are compatible for generation and fine-grained reconstruction. Besides, we identify the intriguing property in quantizing semantic encoders that rely on high-dimensional codebook in contrast to the previous common practice of low-dimensional codebook in image reconstruction. The semantic VQ codebook can achieve a 100% utilization ratio at a dimension of 1536. VQRAE presents competitive performance on several benchmarks of visual understanding, generation and reconstruction with promising scaling property in the autoregressive paradigm for its discrete merits.

Paper Structure

This paper contains 28 sections, 6 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Comparions of different unified tokenizers. (a) Janus janusjanus-pro series adopt dual-encoder paradigm. (b) QLIP Qlip and UniTok Unitok supervise dicrete tokens with CLIP loss. (c) Our VQRAE can produce continuous and discrete tokens for different tasks.
  • Figure 2: Illustration of our unified tokenizer VQRAE. (a) Our VQRAE is built on pretrained VFMs (e.g., SigLIP2 siglip2), which can simultaneously produce continous semantic features for multimodal understanding tasks and discrete tokens for visual generation and reconstruction tasks. (b) Training pipeline of VQRAE. We adopt a two-stage training paradigm. In the first stage, the pretrained semantic encoder remains frozen, while a high-dimensional vector quantization codebook and a pixel decoder are trained using an image reconstruction loss. In the second stage, the encoder, codebook, and decoder are jointly optimized to achieve fine-grained reconstruction. Additionally, the encoder outputs are regularized via a self-distillation loss to maintain semantic understanding performance. (c) VQRAE achieves a superior trade-off with the unified encoder in the autoregressive style.
  • Figure 3: We perform K-means clustering on the ImageNet-1K validation set using continuous features and discrete tokens. The visualization illustrates images grouped by (a) continuous features and (b) discrete tokens, both derived from our VQRAE. VQRAE is capable of producing discriminative features for multimodal understanding and discrete visual tokens for fine-grained reconstruction and generation simultaneously within a unified tokenizer. It indicates the redundancy in the dual-encoder paradigm.
  • Figure 4: Visualization of reconstruction results from VQRAE-InternViT version. Left: input image; Right: output image.
  • Figure 5: Visualization results on ablation study of training strategies. As indicated in Tab. \ref{['tab:abl_training']}, the second training stage adds more fine-grained details on reconstruction and retains semantics, while end-to-end training without distillation constraints fails to achieve a trade-off between them.
  • ...and 4 more figures