Table of Contents
Fetching ...

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

Siyuan Li, Luyuan Zhang, Zedong Wang, Juanxi Tian, Cheng Tan, Zicheng Liu, Chang Yu, Qingsong Xie, Haonan Lu, Haoqian Wang, Zhen Lei

TL;DR

MergeVQ presents a unified, VQ-based framework that bridges visual representation learning and autoregressive image generation by disentangling coarse semantics from latent space via token merging and recovering fine-grained details using a source matrix. It introduces Look-up Free Quantization and a Source Recovery module, enabling reconstruction from compressed semantic tokens and global semantic alignment, while offering two generation modes: raster-order MergeAR with KV-cache compression and randomized AR with Source Recovery. Across ImageNet-1K experiments, MergeVQ variants demonstrate competitive pre-training performance with compact token budgets and strong generation metrics, especially when combined with CFG and RandAR strategies. The approach provides a practical path to jointly optimize discriminative and generative capabilities in a single, efficient architecture.

Abstract

Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

MergeVQ: A Unified Framework for Visual Generation and Representation with Disentangled Token Merging and Quantization

TL;DR

MergeVQ presents a unified, VQ-based framework that bridges visual representation learning and autoregressive image generation by disentangling coarse semantics from latent space via token merging and recovering fine-grained details using a source matrix. It introduces Look-up Free Quantization and a Source Recovery module, enabling reconstruction from compressed semantic tokens and global semantic alignment, while offering two generation modes: raster-order MergeAR with KV-cache compression and randomized AR with Source Recovery. Across ImageNet-1K experiments, MergeVQ variants demonstrate competitive pre-training performance with compact token budgets and strong generation metrics, especially when combined with CFG and RandAR strategies. The approach provides a practical path to jointly optimize discriminative and generative capabilities in a single, efficient architecture.

Abstract

Masked Image Modeling (MIM) with Vector Quantization (VQ) has achieved great success in both self-supervised pre-training and image generation. However, most existing methods struggle to address the trade-off in shared latent space for generation quality vs. representation learning and efficiency. To push the limits of this paradigm, we propose MergeVQ, which incorporates token merging techniques into VQ-based generative models to bridge the gap between image generation and visual representation learning in a unified architecture. During pre-training, MergeVQ decouples top-k semantics from latent space with the token merge module after self-attention blocks in the encoder for subsequent Look-up Free Quantization (LFQ) and global alignment and recovers their fine-grained details through cross-attention in the decoder for reconstruction. As for the second-stage generation, we introduce MergeAR, which performs KV Cache compression for efficient raster-order prediction. Extensive experiments on ImageNet verify that MergeVQ as an AR generative model achieves competitive performance in both visual representation learning and image generation tasks while maintaining favorable token efficiency and inference speed. The code and model will be available at https://apexgen-x.github.io/MergeVQ.

Paper Structure

This paper contains 27 sections, 16 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: MergeVQ learning paradigms. (a) The MergeVQ Tokenizer extracts $K$ semantic tokens with decoupled positional information (retained in the source matrix) by ToMe iclr2022ToMe while quantizing spatial details by LFQ iclr2024FSQICLR2024magvit2, which will be recovered and reconstructed correspondingly. (b) MergeVQ with random-order Generator cvpr2025RandAR generates $K$ discrete tokens with associated position instructions while trained Source Prediction and decoder restore position details. (c) MergeAR Generator predicts $L$ tokens efficiently in a raster-order with tailored KV Cache compression to remove the redundancy within Next-token Prediction (NTP) NIPS2024LLaMAGen.
  • Figure 2: Overview of MergeVQ framework, which contains two stages and three groups of subtasks (Sec. \ref{['sec:mergevq_framework']}). (a) As for representation learning (Sec. \ref{['sec:mergevq_rec+rep']}), $K$ semantic tokens are extracted by the encoder with self-attention and token merging iclr2022ToMe, which can be aligned globally with a pre-trained teacher while learning contextual information by predicting the source matrix. (b) As for reconstruction (Sec. \ref{['sec:mergevq_rec']}), taking $K$ merged and quantized tokens as the input, the positional information can be retained by the Source Recovery module, and then high-quality details will be reconstructed. (c) As for generation (Sec. \ref{['sec:generation']}), we utilize the source matrix to construct a causal mask for training and leverage the KV cache to prune repeated tokens during inference for efficient generation.
  • Figure 3: Analysis of kept tokens in reconstruction and representation learning. Three MergeVQ tokenizers are trained with $128$ resolution for 30 epochs on ImageNet-1K. They keep 256, 144, and 36 tokens with ToMe iclr2022ToMe in the encoder during training. In inference, we evaluate rFID and linear probing top-1 accuracy with diverse merge ratios to show the trade-off between generation and representation. Please view Sec. \ref{['sec:exp']} and Appendix \ref{['app:result']} for details.
  • Figure 4: Visualization of MergeVQ (G+R) reconstruction. With the kept tokens varying from 64 to 256, clustering maps of ToMe Attention indicate that MergeVQ can extract discriminative semantic tokens while recovering contextual positions and details.
  • Figure 5: Distribution of merge ratios sampling in training. (a) With 256 tokens in total, MergeVQ (R) and (G+R) sample the square number as kept token numbers in $[36, 100]$ and $[121, 225]$ with exponential and Gaussian distributions for stage-1 training, while the G+R version sampling from $[144, 256]$ for stage-2 training. (b) With 1024 tokens in total, MergeVQ (G) samples the square kept number in $[225,400]$ and $[256,1024]$ with Gaussian and exponential distributions in both stage-1 and stage-2 training.
  • ...and 3 more figures