Table of Contents
Fetching ...

Point What You Mean: Visually Grounded Instruction Policy

Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Junliang Guo, Junqiao Zhao, Yang Gao

TL;DR

Point-VLA integrates explicit pixel-level grounding into Vision-Language-Action policies by overlaying bounding boxes as visual prompts, resolving referential ambiguity in cluttered and unseen scenes. A semi-automatic data-annotation pipeline using multi-modal LLMs enables scalable grounding supervision, while co-training with text-only data preserves traditional instruction-following ability. Across six real-world tasks, Point-VLA substantially outperforms text-only baselines and demonstrates strong generalization to unseen objects and configurations, including robustness to spatial perturbations and diverse embodiments. The work offers a practical, plug-and-play grounding interface with scalable data augmentation and interactive inference, advancing robust, grounding-aware embodied control.

Abstract

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

Point What You Mean: Visually Grounded Instruction Policy

TL;DR

Point-VLA integrates explicit pixel-level grounding into Vision-Language-Action policies by overlaying bounding boxes as visual prompts, resolving referential ambiguity in cluttered and unseen scenes. A semi-automatic data-annotation pipeline using multi-modal LLMs enables scalable grounding supervision, while co-training with text-only data preserves traditional instruction-following ability. Across six real-world tasks, Point-VLA substantially outperforms text-only baselines and demonstrates strong generalization to unseen objects and configurations, including robustness to spatial perturbations and diverse embodiments. The work offers a practical, plug-and-play grounding interface with scalable data augmentation and interactive inference, advancing robust, grounding-aware embodied control.

Abstract

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

Paper Structure

This paper contains 45 sections, 4 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: We introduce Point-VLA, which resolves the inherent limitations of text-only instructions in precise target referring, e.g., referring objects in clutter, handling unseen OOD objects, or placing on plain tabletop without reference point. By overlaying bounding boxes on images, Point-VLA provides explicit pixel-level cues that enable accurate and unambiguous referring in real-world manipulation.
  • Figure 2: Point-VLA resolves linguistically inexpressible references through explicit visual grounding. In scenes with many visually similar objects, even complex and fully specified textual descriptions cannot generalize reliably, causing ambiguous and incorrect actions.
  • Figure 3: We obtain the visual prompt by drawing a bounding box on the first frame, either annotated automatically or manually. This grounded frame is then lightly augmented (CutMix, translation) and paired with every robot observation in the episode. The model is trained using both the current observation and the fixed grounded first-frame prompt, enabling consistent pixel-level target grounding throughout the trajectory.
  • Figure 4: Overviews of our task and robot embodiment. (a) We hire professional operators to collect real-world robot demonstration data. (b) Two robot embodiments used for evaluation: a fixed dual-arm robot and a full-body humanoid robot. (c) Representative tasks, including picking irregular objects, picking OOD objects, picking in clutter, precise picking in dense trays, placing on a plain tabletop without reference points, and precise placing. These tasks contain targets that cannot be precisely referred to using text alone.
  • Figure 5: Success rates (%) on three spatial referring tasks under three instruction modes: the Text VLA baseline, Point-VLA($l$) with text-only instructions, and Point-VLA(VGI) with visually grounded instructions (text for high-level action, bounding box for spatial reference). Point-VLA($l$) matches or exceeds the baseline, and Point-VLA(VGI) achieves the highest success rates on complex spatial references.
  • ...and 8 more figures