Table of Contents
Fetching ...

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

Tianren Ma, Lingxi Xie, Yunjie Tian, Boyu Yang, Qixiang Ye

TL;DR

ClawMachine presents a unified, end-to-end multimodal model that notates visual entities as token collectives within a joint vision-language vocabulary, enabling native referential comprehension (referring and grounding) without extra syntax. It combines a hybrid perception of continuous and discrete visual signals, a V-L mounting operation to fuse tokens into language prompts, and a region sampler to convert token collectives into grounding boxes. Through dual data pretraining (scene-level, region-level, and interleaved GRIT-20M data) and two-stage training (alignment pre-training and instruction-tuning), ClawMachine achieves state-of-the-art or competitive results on visual referring and grounding benchmarks with higher efficiency and fewer hallucinations. The approach demonstrates that pure autoregressive models can surpass architectures with large modular components, offering scalable, integrated capabilities for complex visual reasoning and multi-object grounding in real-world tasks.

Abstract

Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension

TL;DR

ClawMachine presents a unified, end-to-end multimodal model that notates visual entities as token collectives within a joint vision-language vocabulary, enabling native referential comprehension (referring and grounding) without extra syntax. It combines a hybrid perception of continuous and discrete visual signals, a V-L mounting operation to fuse tokens into language prompts, and a region sampler to convert token collectives into grounding boxes. Through dual data pretraining (scene-level, region-level, and interleaved GRIT-20M data) and two-stage training (alignment pre-training and instruction-tuning), ClawMachine achieves state-of-the-art or competitive results on visual referring and grounding benchmarks with higher efficiency and fewer hallucinations. The approach demonstrates that pure autoregressive models can surpass architectures with large modular components, offering scalable, integrated capabilities for complex visual reasoning and multi-object grounding in real-world tasks.

Abstract

Aligning vision and language concepts at a finer level remains an essential topic of multimodal large language models (MLLMs), particularly for tasks such as referring and grounding. Existing methods, such as proxy encoding and geometry encoding, incorporate additional syntax to encode spatial information, imposing extra burdens when communicating between language and vision modules. In this study, we propose ClawMachine, offering a new methodology that explicitly notates each entity using token collectives groups of visual tokens that collaboratively represent higher level semantics. A hybrid perception mechanism is also explored to perceive and understand scenes from both discrete and continuous spaces. Our method unifies the prompt and answer of visual referential tasks without using additional syntax. By leveraging a joint vision-language vocabulary, ClawMachine further integrates referring and grounding in an auto-regressive manner, demonstrating great potential with scaled-up pre-training data. Experiments show that ClawMachine achieves superior performance on scene-level and referential understanding tasks with higher efficiency. It also exhibits the potential to integrate multi-source information for complex visual reasoning, which is beyond the capability of many MLLMs. Our code is available at github.com/martian422/ClawMachine.
Paper Structure (22 sections, 1 equation, 9 figures, 18 tables)

This paper contains 22 sections, 1 equation, 9 figures, 18 tables.

Figures (9)

  • Figure 1: A conceptual comparison between existing MLLMs and our ClawMachine in notating an object in the image. ClawMachine does not use extra syntax, but directly embeds visual tokens to the natural language, supporting fine-level visual understanding (e.g., referring and grounding) in a native mechanism.
  • Figure 2: Framework of ClawMachine. When an image (or a region) is referred to, the corresponding visual tokens are directly embedded to the natural language. ClawMachine performs next-token prediction, and the output visual tokens are projected back to the image for grounding. $B$ represents batchsize, while $dim_V$ and $dim_L$ denotes the dimension of visual and language embeddings. Embed denotes LLM's embedding layer, and is demonstrated separately for intuitive explanation.
  • Figure 3: ClawMachine generates visual tokens, projects them to the image lattice (denoted by stars), and predicts the grounded box (denoted by rectangles). The top row shows the ability to ground different objects within one image. The bottom row shows visual tokens with the same ID.
  • Figure 4: ClawMachine can solve complex visual reasoning tasks. See the texts for explanations.
  • Figure 5: Model's detailed grounding performance with controlled IoU thresholds and object sizes. Note: small(23.6%): $S\in(0, 0.08)$, medium(49.1%): $S\in(0.08, 0.25)$, large(27.3%): $S\in(0.25, 1)$
  • ...and 4 more figures