Point What You Mean: Visually Grounded Instruction Policy

Hang Yu; Juntu Zhao; Yufeng Liu; Kaiyu Li; Cheng Ma; Di Zhang; Yingdong Hu; Guang Chen; Junyuan Xie; Junliang Guo; Junqiao Zhao; Yang Gao

Point What You Mean: Visually Grounded Instruction Policy

Hang Yu, Juntu Zhao, Yufeng Liu, Kaiyu Li, Cheng Ma, Di Zhang, Yingdong Hu, Guang Chen, Junyuan Xie, Junliang Guo, Junqiao Zhao, Yang Gao

TL;DR

Point-VLA integrates explicit pixel-level grounding into Vision-Language-Action policies by overlaying bounding boxes as visual prompts, resolving referential ambiguity in cluttered and unseen scenes. A semi-automatic data-annotation pipeline using multi-modal LLMs enables scalable grounding supervision, while co-training with text-only data preserves traditional instruction-following ability. Across six real-world tasks, Point-VLA substantially outperforms text-only baselines and demonstrates strong generalization to unseen objects and configurations, including robustness to spatial perturbations and diverse embodiments. The work offers a practical, plug-and-play grounding interface with scalable data augmentation and interactive inference, advancing robust, grounding-aware embodied control.

Abstract

Vision-Language-Action (VLA) models align vision and language with embodied control, but their object referring ability remains limited when relying solely on text prompt, especially in cluttered or out-of-distribution (OOD) scenes. In this study, we introduce the Point-VLA, a plug-and-play policy that augments language instructions with explicit visual cues (e.g., bounding boxes) to resolve referential ambiguity and enable precise object-level grounding. To efficiently scale visually grounded datasets, we further develop an automatic data annotation pipeline requiring minimal human effort. We evaluate Point-VLA on diverse real-world referring tasks and observe consistently stronger performance than text-only instruction VLAs, particularly in cluttered or unseen-object scenarios, with robust generalization. These results demonstrate that Point-VLA effectively resolves object referring ambiguity through pixel-level visual grounding, achieving more generalizable embodied control.

Point What You Mean: Visually Grounded Instruction Policy

TL;DR

Abstract

Point What You Mean: Visually Grounded Instruction Policy

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)