Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen; Chen Ju; Zhicheng Wang; Shuai Xiao; Xu Chen; Jinsong Lan; Xiaoyong Zhu; Ying Chen

Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

Yizhu Chen, Chen Ju, Zhicheng Wang, Shuai Xiao, Xu Chen, Jinsong Lan, Xiaoyong Zhu, Ying Chen

TL;DR

This work addresses the challenge of unifying understanding and generation in multimodal large language models by moving beyond a binary continuous versus discrete visual tokenizer. It introduces the continuous–discrete dualistic visual tokenizer (CDD-VT), which uses Diverse Quantitative Primitives (DQP) to diversify a multi-sub-codebook vocabulary and Dynamic Primitive Allocator (DPA) to adaptively allocate primitives per image patch, yielding discrete-like efficiency for simple regions and continuous-like fidelity for complex regions. The model integrates vision and language via a multimodal autoregression framework and demonstrates strong reconstruction quality, competitive zero-shot image-text retrieval, and robust classification performance, all while maintaining a concise, scalable architecture. These results suggest that adaptive granularity tokenization can closely approach the benefits of continuous tokenizers while preserving the simplicity of discrete tokenizers, enabling more efficient and unified multimodal understanding-generation pipelines.

Abstract

The unification of understanding and generation within a single multi-modal large model (MLLM) remains one significant challenge, largely due to the dichotomy between continuous and discrete visual tokenizations. Continuous tokenizer (CT) achieves strong performance by bridging multiple independently-trained understanding modules and generation modules, but suffers from complex multi-stage pipelines and substantial engineering overhead. Conversely, discrete tokenizers (DT) offer a conceptually elegant idea by quantizing each image into a primitive, but inevitably leading to information loss and performance degradation. To resolve this tension, we question the binary choice between CT and DT, inspired by the wave-particle duality of light, and propose the Continuous-Discrete Dualistic Visual Tokenizer (CDD-VT). We treat visual data as a flexible composition of image primitives derived from quantized codebooks, with the crucial insight that the primitive number assigned to each visual sample is adaptively determined according to its complexity: simple instances use a few primitives, emulating discrete tokenization, while complex instances use many, approximating continuous tokenization. Two core components are designed: Diverse Quantitative Primitives, which encourage primitives orthogonality to better populate information space, and Dynamic Primitive Allocator, which assesses sample complexity to determine the optimal set of primitives. Extensive experiments on reconstruction, retrieval and classification show that CDD-VT achieves superior performance over to specialized CT and DT, effectively getting strong result within a concise and scalable MLLM.

Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

TL;DR

Abstract

Wave-Particle (Continuous-Discrete) Dualistic Visual Tokenization for Unified Understanding and Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)