OVAL-Prompt: Open-Vocabulary Affordance Localization for Robot Manipulation through LLM Affordance-Grounding
Edmond Tong, Anthony Opipari, Stanley Lewis, Zhen Zeng, Odest Chadwicke Jenkins
TL;DR
OVAL-Prompt addresses open-vocabulary affordance grounding for robot manipulation by pairing a Vision-Language Model for part segmentation with a Large Language Model for grounding those parts to affordances through carefully structured prompts, all without domain-specific finetuning. The method achieves competitive localization performance on the UMD geometric-umd dataset, demonstrating practical viability for zero-shot affordance grounding and open-set object categories. A key finding is the importance of prompt structure and a reprompting step to obtain compatible part names, which significantly improves segmentation and grounding accuracy. Real-robot demonstrations further show that open-vocabulary affordance grounding can enable manipulation of novel objects and categories in unstructured environments, underscoring the practical impact of open-vocabulary grounding in robotics.
Abstract
In order for robots to interact with objects effectively, they must understand the form and function of each object they encounter. Essentially, robots need to understand which actions each object affords, and where those affordances can be acted on. Robots are ultimately expected to operate in unstructured human environments, where the set of objects and affordances is not known to the robot before deployment (i.e. the open-vocabulary setting). In this work, we introduce OVAL-Prompt, a prompt-based approach for open-vocabulary affordance localization in RGB-D images. By leveraging a Vision Language Model (VLM) for open-vocabulary object part segmentation and a Large Language Model (LLM) to ground each part-segment-affordance, OVAL-Prompt demonstrates generalizability to novel object instances, categories, and affordances without domain-specific finetuning. Quantitative experiments demonstrate that without any finetuning, OVAL-Prompt achieves localization accuracy that is competitive with supervised baseline models. Moreover, qualitative experiments show that OVAL-Prompt enables affordance-based robot manipulation of open-vocabulary object instances and categories. Project Page: https://ekjt.github.io/OVAL-Prompt/
