Table of Contents
Fetching ...

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

Haotian Xue, Yunhao Ge, Yu Zeng, Zhaoshuo Li, Ming-Yu Liu, Yongxin Chen, Jiaojiao Fan

TL;DR

The paper identifies a gap in evaluating embodied reasoning in vision-language models due to reliance on indirect or language-only benchmarks. It introduces Point-It-Out (PIO), a three-stage, pixel-grounding benchmark that localizes referred objects (S1), grounds task-relevant elements (S2), and predicts visual traces for planning (S3) across indoor, kitchen, driving, and robotics domains. Through evaluation of over ten VLMs, the work shows that grounding-supervised models excel at S1/S2 while large generalist models can struggle with precise grounding, and that successful S3 trajectory prediction requires integrated grounding and planning capabilities. The study provides comprehensive data, evaluation metrics, and prompts, revealing clear gaps and guiding principles for building grounding-aware embodied agents with real-world impact.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (PIO) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.

Point-It-Out: Benchmarking Embodied Reasoning for Vision Language Models in Multi-Stage Visual Grounding

TL;DR

The paper identifies a gap in evaluating embodied reasoning in vision-language models due to reliance on indirect or language-only benchmarks. It introduces Point-It-Out (PIO), a three-stage, pixel-grounding benchmark that localizes referred objects (S1), grounds task-relevant elements (S2), and predicts visual traces for planning (S3) across indoor, kitchen, driving, and robotics domains. Through evaluation of over ten VLMs, the work shows that grounding-supervised models excel at S1/S2 while large generalist models can struggle with precise grounding, and that successful S3 trajectory prediction requires integrated grounding and planning capabilities. The study provides comprehensive data, evaluation metrics, and prompts, revealing clear gaps and guiding principles for building grounding-aware embodied agents with real-world impact.

Abstract

Vision-Language Models (VLMs) have demonstrated impressive world knowledge across a wide range of tasks, making them promising candidates for embodied reasoning applications. However, existing benchmarks primarily evaluate the embodied reasoning ability of VLMs through multiple-choice questions based on image annotations -- for example, selecting which trajectory better describes an event in the image. In this work, we introduce the Point-It-Out (PIO) benchmark, a novel benchmark designed to systematically assess the embodied reasoning abilities of VLMs through precise visual grounding. We propose a hierarchical evaluation protocol spanning three stages (S1: referred-object localization, S2: task-driven pointing, and S3: visual trace prediction), with data collected from critical domains for embodied intelligence, including indoor, kitchen, driving, and robotic manipulation scenarios. Extensive experiments with over ten state-of-the-art VLMs reveal several interesting findings. For example, strong general-purpose models such as GPT-4o, while excelling on many benchmarks (e.g., language, perception, and reasoning), underperform compared to some open-source models in precise visual grounding; models such as MoLMO perform well in S1 and S2 but struggle in S3, where requires grounding combined with visual trace planning.

Paper Structure

This paper contains 19 sections, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Unlike prior benchmarks that rely on indirect evaluation (a), Point-It-Out (PIO) directly assesses embodied reasoning (ER) by prompting VLMs to generate precise visual groundings—such as points, bounding boxes, or trajectories—in a hierarchical manner as shown in (b). To our knowledge, PIO is the first benchmark to offer pixel-level grounding for ER, spanning diverse embodied tasks across multiple real-world scenarios.
  • Figure 2: A Hierarchical Framework for Visual Grounding in Embodied Reasoning. We propose a three-stage progression: S1 (object localization) localize objects explicited referred to in the text, with some conditions like granularity and appearance; S2 (task-driven grounding) builds on S1 to infer locations used in specific task, which may not be explicitly referred to in the text ; and S3 (visual trace prediction) combines S1 and S2 to generate executable motion plans. Underlined text denotes the referred object that needs to be localized (S1), while yellow highlights indicate task-contexts in task-oriented reasoning (S2/S3).
  • Figure 3: More Examples for Three-Stages Grounding across Embodied Tasks: We illustrate more examples across driving, kitchen, and robotic domains that align with our three-stage hierarchy.
  • Figure 4: Examples and Distributions of S1 and S2 Subclasses: here we show examples of subclasess for S1 (object w/o ambiguity, object part, and object with constraints in e.g. locations, color) and S2 (affordance, prediction, safety, contact and recommendation); and also the % of them in the each stage.
  • Figure 5: Performance on S1 and S2 for Different VLMs. (Left) Model scores on S1 and S2 tasks. RoboRefer-SFT-8B, MoLMO-7B, Gemini‑2.5‑Pro, and Qwen-2.5-VL significantly outperform other models. (Right) Average scores across S1 and S2 of different model in four distinct scenarios.
  • ...and 6 more figures