Table of Contents
Fetching ...

Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, Ying Chen

TL;DR

This work addresses the challenge of unifying understanding and generation in multimodal large language models by moving beyond a binary continuous versus discrete visual tokenizer. It introduces the continuous–discrete dualistic visual tokenizer (CDD-VT), which uses Diverse Quantitative Primitives (DQP) to diversify a multi-sub-codebook vocabulary and Dynamic Primitive Allocator (DPA) to adaptively allocate primitives per image patch, yielding discrete-like efficiency for simple regions and continuous-like fidelity for complex regions. The model integrates vision and language via a multimodal autoregression framework and demonstrates strong reconstruction quality, competitive zero-shot image-text retrieval, and robust classification performance, all while maintaining a concise, scalable architecture. These results suggest that adaptive granularity tokenization can closely approach the benefits of continuous tokenizers while preserving the simplicity of discrete tokenizers, enabling more efficient and unified multimodal understanding-generation pipelines.

Abstract

The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.

Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

TL;DR

This work addresses the challenge of unifying understanding and generation in multimodal large language models by moving beyond a binary continuous versus discrete visual tokenizer. It introduces the continuous–discrete dualistic visual tokenizer (CDD-VT), which uses Diverse Quantitative Primitives (DQP) to diversify a multi-sub-codebook vocabulary and Dynamic Primitive Allocator (DPA) to adaptively allocate primitives per image patch, yielding discrete-like efficiency for simple regions and continuous-like fidelity for complex regions. The model integrates vision and language via a multimodal autoregression framework and demonstrates strong reconstruction quality, competitive zero-shot image-text retrieval, and robust classification performance, all while maintaining a concise, scalable architecture. These results suggest that adaptive granularity tokenization can closely approach the benefits of continuous tokenizers while preserving the simplicity of discrete tokenizers, enabling more efficient and unified multimodal understanding-generation pipelines.

Abstract

The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.

Paper Structure

This paper contains 12 sections, 4 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Tokenization Comparisons.Continuous Tokenizers (CT): strong performance, but one complex, multi-stage workflow. Discrete Tokenizers (DT): a concise-unified workflow, but poor results. CDD-VT as continuous-discrete dualistic quantization, gets remarkable results in elegant unity of understanding & generation.
  • Figure 2: Framework Overview. Our continuous–discrete dualistic visual tokenizer (CDD-VT) consists of an image encoder, a text encoder, an image decoder, and two core components: Diverse Quantitative Primitives (DQP) and Dynamic Primitive Allocator (DPA). DQP encourages primitives orthogonality to better populate information space, while DPA assesses sample complexity to determine the optimal primitive set for each patch. For understanding, we calculate cosine similarity between text embeddings and quantified image embeddings. For generation, we feed text through the pipeline of encoder, LLM, and vision decoder.
  • Figure 3: Comparisons of Image Reconstruction between UniTok ma2025unitok, QLIP zhao2025qlip, TokenFlow Tokenflow_2025_CVPR and our CDD-VT. Here, FID and PSNR are evaluated on ImageNet 50k validation. CDD-VT shows superior detail preservation and perceptual fidelity over the competitors, e.g., text on the map (Row 2) and blackboard (Row 1) is notably clearer and more faithful to inputs. Additional results can be found in Appendix \ref{['app: recon']}.
  • Figure 4: Reconstruction Error during Training.
  • Figure 5: DPA Effectiveness. The Top-$K$ baseline (here $K$=1) assigns the fixed number of primitives to each patch, resulting in a uniform grid overlay. CDD-VT equipped with DPA produces an adaptive heatmap, focusing more primitives on areas with complex information. Color bars denote primitive count.
  • ...and 1 more figures