Table of Contents
Fetching ...

Introducing Visual Perception Token into Multimodal Large Language Model

Runpeng Yu, Xinyin Ma, Xinchao Wang

TL;DR

This work introduces Visual Perception Tokens to empower Multimodal Large Language Models with autonomous, token-driven control over visual perception. It defines two token types—Region Selection Token for targeted cropping and Vision Re-Encoding Token for re-encoding with an auxiliary vision encoder—and demonstrates how these tokens can be generated during next-token prediction to trigger additional perception steps. Through a 829k-sample training dataset spanning OCR, spatial reasoning, and VQA tasks, the approach yields notable gains, with 2B models matching or exceeding 7B baselines and Free Choice prompting delivering further improvements in spatial and fine-grained understanding. The results, ablations on token granularity, and supplementary experiments support the efficacy and generalizability of token-based visual perception control in MLLMs, suggesting broad applicability to other prompting techniques and vision encoders.

Abstract

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

Introducing Visual Perception Token into Multimodal Large Language Model

TL;DR

This work introduces Visual Perception Tokens to empower Multimodal Large Language Models with autonomous, token-driven control over visual perception. It defines two token types—Region Selection Token for targeted cropping and Vision Re-Encoding Token for re-encoding with an auxiliary vision encoder—and demonstrates how these tokens can be generated during next-token prediction to trigger additional perception steps. Through a 829k-sample training dataset spanning OCR, spatial reasoning, and VQA tasks, the approach yields notable gains, with 2B models matching or exceeding 7B baselines and Free Choice prompting delivering further improvements in spatial and fine-grained understanding. The results, ablations on token granularity, and supplementary experiments support the efficacy and generalizability of token-based visual perception control in MLLMs, suggesting broad applicability to other prompting techniques and vision encoders.

Abstract

To utilize visual information, Multimodal Large Language Model (MLLM) relies on the perception process of its vision encoder. The completeness and accuracy of visual perception significantly influence the precision of spatial reasoning, fine-grained understanding, and other tasks. However, MLLM still lacks the autonomous capability to control its own visual perception processes, for example, selectively reviewing specific regions of an image or focusing on information related to specific object categories. In this work, we propose the concept of Visual Perception Token, aiming to empower MLLM with a mechanism to control its visual perception processes. We design two types of Visual Perception Tokens, termed the Region Selection Token and the Vision Re-Encoding Token. MLLMs autonomously generate these tokens, just as they generate text, and use them to trigger additional visual perception actions. The Region Selection Token explicitly identifies specific regions in an image that require further perception, while the Vision Re-Encoding Token uses its hidden states as control signals to guide additional visual perception processes. Extensive experiments demonstrate the advantages of these tokens in handling spatial reasoning, improving fine-grained understanding, and other tasks. On average, the introduction of Visual Perception Tokens improves the performance of a 2B model by 23.6\%, increasing its score from 0.572 to 0.708, and even outperforms a 7B parameter model by 13.4\% (from 0.624). Please check out our repo https://github.com/yu-rp/VisualPerceptionToken

Paper Structure

This paper contains 24 sections, 1 equation, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The relationship between the Region Selection Tokens we used and a precise bbox. Region Selection Token uses the cells containing the top-left and bottom-right corners to describe the approximate location of the region. In this example, the image is evenly divided into $4\times 4$ cells. In our main experiment, we divide images into $8\times 8$ cells.
  • Figure 2: In a standard MLLM generation process, the model directly outputs an response based on the input image and query. However, an MLLM equipped with Visual Perception Tokens can first generate special tokens that trigger additional perception processes before responding. If the MLLM outputs a Region Selection Token, the original image is cropped and reprocessed through the visual encoder. The MLLM then bases its answer on two sets of visual embeddings: the first set contains the global embeddings from the original image, and the second set contains the local embeddings from the cropped image. If the MLLM outputs a DINO Feature Token, the DINO features of the image are used to supplement the original CLIP-based features. Additionally, besides the DINO features, the hidden state of the DINO Feature Token is also input to the projector as a condition to control which features are ultimately passed to the language model.
  • Figure 3: Examples collected from the testing sets. The responses were generated by the 7B model and the 2B+VPT model. During the generation process, if Region Selection Tokens were utilized, the region selected by these tokens are highlighted with red boxes in the images. For additional examples, please refer to the supplementary material.
  • Figure S1: This set of images demonstrates how the DINO Feature Token assists MLLMs in identifying specific objects within images. These objects are often difficult for MLLMs to recognize directly due to their small size or interference from surrounding objects.
  • Figure S2: This set of images illustrates how the DINO Feature Token assists MLLMs in counting the number of objects in an image. Counting has long been a significant limitation for MLLMs. By leveraging the DINO Feature, the DINO Feature Token enables precise localization of individual objects within the image, thereby improving the counting capability of MLLMs.
  • ...and 2 more figures