Table of Contents
Fetching ...

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Christopher Clark, Yue Yang, Jae Sung Park, Zixian Ma, Jieyu Zhang, Rohun Tripathi, Mohammadreza Salehi, Sangho Lee, Taira Anderson, Winson Han, Ranjay Krishna

Abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

MolmoPoint: Better Pointing for VLMs with Grounding Tokens

Abstract

Grounding has become a fundamental capability of vision-language models (VLMs). Most existing VLMs point by generating coordinates as part of their text output, which requires learning a complicated coordinate system and results in a high token count. Instead, we propose a more intuitive pointing mechanism that directly selects the visual tokens that contain the target concept. Our model generates a special pointing token that cross-attends to the input image or video tokens and selects the appropriate one. To make this model more fine-grained, we follow these pointing tokens with an additional special token that selects a fine-grained subpatch within the initially selected region, and then a third token that specifies a location within that subpatch. We further show that performance improves by generating points sequentially in a consistent order, encoding the relative position of the previously selected point, and including a special no-more-points class when selecting visual tokens. Using this method, we set a new state-of-the-art on image pointing (70.7% on PointBench), set a new state-of-the-art among fully open models on GUI pointing (61.1% on ScreenSpotPro), and improve video pointing (59.1% human preference win rate vs. a text coordinate baseline) and tracking (+6.3% gain on Molmo2Track). We additionally show that our method achieves much higher sample efficiency and discuss the qualitative differences that emerge from this design change.

Paper Structure

This paper contains 43 sections, 6 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Overview of MolmoPoint. To point, our model scores coarse-grained image patches using the LLM's hidden states, then scores fine-grained subpatches from the highest scoring patch using ViT image features, and then selects a point within the highest scoring subpatch.
  • Figure 2: Pointing with grounding tokens. Keys are built from image tokens and ViT patch features, and queries are built from the <PATCH> token and <SUBPATCH> token hidden states, to score patches and subpatches. The <LOCATION> token predicts the final output points within the highest scoring subpatch.
  • Figure 3: Overview of the generation of MolmoPoint-GUISyn. We prompt an LLM to generate the HTML for the screenshot and extract all bounding boxes of its UI elements. Then we use LLMs to annotate each bounding box with its interaction intents.
  • Figure 4: MolmoPoint-TrackAny: human-annotated point-to-track extension. Annotators are given a text query and an object of interest, and provide point tracks while marking frames as occluded when the object is not visible.
  • Figure 5: Sample efficiency. Left: Performance when using a very limited number of pointing training examples. Right: Pointing performance during full-scale pre-training.
  • ...and 3 more figures