Table of Contents
Fetching ...

Towards Semantic Equivalence of Tokenization in Multimodal LLM

Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan

TL;DR

This paper addresses the misalignment between visual and textual semantics in multimodal LLMs by introducing SeTok, a dynamic, semantic-equivalent vision tokenizer that forms semantically complete tokens via density-peak clustering and a token-merger stage. Integrated with a pre-trained LLM as Setokim, the framework uses autoregressive training with a combined image reconstruction and concept-level image-text contrastive loss, enabling robust vision-language understanding and generation. Across tasks such as visual understanding, image generation/editing, and referring segmentation, SeTok/Setokim demonstrate superior semantic alignment and finer-grained capabilities compared to traditional patch-based or learnable-query tokenizations, supported by extensive ablations and qualitative analyses. The approach offers a scalable path to more interpretable and semantically coherent multimodal interactions, with practical impact on V&L tasks requiring precise token-level alignment and manipulation.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.

Towards Semantic Equivalence of Tokenization in Multimodal LLM

TL;DR

This paper addresses the misalignment between visual and textual semantics in multimodal LLMs by introducing SeTok, a dynamic, semantic-equivalent vision tokenizer that forms semantically complete tokens via density-peak clustering and a token-merger stage. Integrated with a pre-trained LLM as Setokim, the framework uses autoregressive training with a combined image reconstruction and concept-level image-text contrastive loss, enabling robust vision-language understanding and generation. Across tasks such as visual understanding, image generation/editing, and referring segmentation, SeTok/Setokim demonstrate superior semantic alignment and finer-grained capabilities compared to traditional patch-based or learnable-query tokenizations, supported by extensive ablations and qualitative analyses. The approach offers a scalable path to more interpretable and semantically coherent multimodal interactions, with practical impact on V&L tasks requiring precise token-level alignment and manipulation.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in processing vision-language tasks. One of the crux of MLLMs lies in vision tokenization, which involves efficiently transforming input visual signals into feature representations that are most beneficial for LLMs. However, existing vision tokenizers, essential for semantic alignment between vision and language, remain problematic. Existing methods aggressively fragment visual input, corrupting the visual semantic integrity. To address this, this paper proposes a novel dynamic Semantic-Equivalent Vision Tokenizer (SeTok), which groups visual features into semantic units via a dynamic clustering algorithm, flexibly determining the number of tokens based on image complexity. The resulting vision tokens effectively preserve semantic integrity and capture both low-frequency and high-frequency visual features. The proposed MLLM (Setokim) equipped with SeTok significantly demonstrates superior performance across various tasks, as evidenced by our experimental results. The project page is at https://chocowu.github.io/SeTok-web/.
Paper Structure (49 sections, 5 equations, 11 figures, 14 tables, 1 algorithm)

This paper contains 49 sections, 5 equations, 11 figures, 14 tables, 1 algorithm.

Figures (11)

  • Figure 1: Comparison between existing MLLMs in tokenized visual inputs: (a) patch-level continuous token, (b) patch-level discrete token, (c) learnable query token, and (d) semantic-equivalent continuous token (ours). In (e), we show four language-driven vision tasks enhanced with semantic-equivalent vision tokens, with token masks showing regions of the same color representing a single vision token.
  • Figure 2: Overview of SeTok. SeTok tokenizes visual features extracted from an image by a vision encoder into semantically equivalent vision tokens, which then are fed into a detokenizer to reconstruct the image and meanwhile employed to perform the concept-level image-text alignment.
  • Figure 3: The overview of SeTokim.
  • Figure 4: Qualitative results on image understanding and generation. The words marked in green are key elements in questions and answers. Best view it on screen.
  • Figure 5: Qualitative comparison between MLLMs for the image editing. Setokim excels in adhering to instructions and preserving low-level image details.
  • ...and 6 more figures