Table of Contents
Fetching ...

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

Ziyuan Huang, DanDan Zheng, Cheng Zou, Rui Liu, Xiaolong Wang, Kaixiang Ji, Weilong Chai, Jianxin Sun, Libin Wang, Yongjie Lv, Taozhi Huang, Jiajia Liu, Qingpei Guo, Ming Yang, Jingdong Chen, Jun Zhou

TL;DR

Ming-UniVision introduces MingTok, a continuous visual tokenizer that eliminates vector quantization to unify visual understanding and generation within a single autoregressive framework. Built on MingTok, Ming-UniVision employs a unified input representation and next-token prediction to perform understanding, generation, and editing in a shared latent space, enabling efficient multi-round in-context interactions and reduced token counts. The approach achieves competitive multi-modal understanding and state-of-the-art generation on GenEval, with strong editing capabilities and high-fidelity reconstruction, while highlighting practical workflows such as iterative super-resolution and segmentation-guided edits. Limitations include the need for large-scale interleaved pretraining and further refinement of fine-grained editing, which the authors plan to address in future work. Overall, the work demonstrates the potential of a unified continuous visual representation to simplify architecture and enable versatile, interactive multimodal AI systems.

Abstract

Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

Ming-UniVision: Joint Image Understanding and Generation with a Unified Continuous Tokenizer

TL;DR

Ming-UniVision introduces MingTok, a continuous visual tokenizer that eliminates vector quantization to unify visual understanding and generation within a single autoregressive framework. Built on MingTok, Ming-UniVision employs a unified input representation and next-token prediction to perform understanding, generation, and editing in a shared latent space, enabling efficient multi-round in-context interactions and reduced token counts. The approach achieves competitive multi-modal understanding and state-of-the-art generation on GenEval, with strong editing capabilities and high-fidelity reconstruction, while highlighting practical workflows such as iterative super-resolution and segmentation-guided edits. Limitations include the need for large-scale interleaved pretraining and further refinement of fine-grained editing, which the authors plan to address in future work. Overall, the work demonstrates the potential of a unified continuous visual representation to simplify architecture and enable versatile, interactive multimodal AI systems.

Abstract

Visual tokenization remains a core challenge in unifying visual understanding and generation within the autoregressive paradigm. Existing methods typically employ tokenizers in discrete latent spaces to align with the tokens from large language models, where the quantization errors can limit semantic expressiveness and degrade the capability of vision-language understanding. To address this, we introduce MingTok, a new family of visual tokenizers with a continuous latent space, for unified autoregressive generation and understanding. While understanding tasks favor discriminative high-dimensional features, generation tasks prefer compact low-level codes. Thus, to reconcile these competing demands, MingTok adopts a three-stage sequential architecture involving low-level encoding, semantic expansion, and visual reconstruction. Built on top of it, Ming-UniVision eliminates the need for task-specific visual representations, and unifies diverse vision-language tasks under a single autoregrsssive prediction paradigm. By formulating both understanding and generation as next-token prediction in a shared continuous space, it seamlessly supports multi-round, in-context tasks such as iterative understanding, generation and editing. Empirically, we find that using a unified continuous visual representation reconciles the competing requirements on the tokenizers by the understanding and generation tasks, thereby leading to state-of-the-art level performance across both domains. We hope our findings will facilitate unified visual tokenization in the continuous domain. Inference code and model weights are released to benefit community.

Paper Structure

This paper contains 23 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Conceptual comparison and qualitative examples of MingTok. (a) Existing models using continuous latent spaces for unified visual understanding and generation uses two sets of representations for visual contents. (b) MingTok employes a unified tokenizer for generating semantic and low-level image representatinos. (c) Compared with SD-VAE sd_2022, MingTok achieves over 3.5 times acceleration for text-to-image generation.
  • Figure 2: The model architecture and the training objectives of MingTok. MingTok performs image compression, semantic decoding and image reconstruction sequentially through low-level encoder, semantic decoder, and pixel decoder. During training, both the image latent and the semantic features are supervised by pre-trained visual encoders with masked feature prediction, while the pixel decoder is trained by masked and unmasked image reconstruction.
  • Figure 3: The architecture of Ming-UniVision. Owing to the autoregressive semantic decoding capability of MingTok, both image understanding (image-to-text generation) and image synthesis (text-to-image generation) can be formulated consistently with the same next-token prediction paradigm and unified input representation space. This allows our unified multimodal model to support multi-round in-context tasks, seamlessly switch from understanding to generation/editing task, and vice versa.
  • Figure 4: Comparison of input token structures across different unified model architectures. Ming-UniVision reduces the number of input visual tokens by 66% compared to hybrid AR-diffusion models shi2024lmfusiondeng2025bagel and by 50% compared to existing unified autoregressive models fan2025unifluid, thanks to the unified representation enabled by MingTok.
  • Figure 5: Generation performance comparison during pre-training across different understanding (U) and generation (G) tokenizer combinations. Using MingTok as the generation representation (MingTok (G)) achieves the best performance in generation-only training, significantly outperforming VAE-based representations (VAE (G)). When MingTok is used for both roles (MingTok (G & U), unified setting), the performance gap between pure generation and unified training narrows notably, demonstrating the benefit of universal visual representations for joint vision-language modeling.
  • ...and 3 more figures