Table of Contents
Fetching ...

OVAL-Grasp: Open-Vocabulary Affordance Localization for Task Oriented Grasping

Edmond Tong, Advaith Balaji, Anthony Opipari, Stanley Lewis, Zhen Zeng, Odest Chadwicke Jenkins

TL;DR

OVAL-Grasp addresses zero-shot task-oriented grasping by grounding language-described affordances to object parts using a modular LLM–VLM pipeline. The method decomposes objects into desirable/undesirable parts with an LLM, segments them with a VLM, constructs an affordance heatmap, and scores grasp proposals from a geometry-based generator to select the final grasp $g \in SE(3)$. In experiments on 20 objects with 3 tasks per object, OVAL-Grasp achieves Part Selection 95.0% and Grasp 78.3%, outperforming GraspGPT and ShapeGrasp, and demonstrates robustness under occlusion and clutter through ablations. The design is modular and scalable with improved foundation models, but real-time closed-loop control remains future work.

Abstract

To manipulate objects in novel, unstructured environments, robots need task-oriented grasps that target object parts based on the given task. Geometry-based methods often struggle with visually defined parts, occlusions, and unseen objects. We introduce OVAL-Grasp, a zero-shot open-vocabulary approach to task-oriented, affordance based grasping that uses large-language models and vision-language models to allow a robot to grasp objects at the correct part according to a given task. Given an RGB image and a task, OVAL-Grasp identifies parts to grasp or avoid with an LLM, segments them with a VLM, and generates a 2D heatmap of actionable regions on the object. During our evaluations, we found that our method outperformed two task oriented grasping baselines on experiments with 20 household objects with 3 unique tasks for each. OVAL-Grasp successfully identifies and segments the correct object part 95% of the time and grasps the correct actionable area 78.3% of the time in real-world experiments with the Fetch mobile manipulator. Additionally, OVAL-Grasp finds correct object parts under partial occlusions, demonstrating a part selection success rate of 80% in cluttered scenes. We also demonstrate OVAL-Grasp's efficacy in scenarios that rely on visual features for part selection, and show the benefit of a modular design through our ablation experiments. Our project webpage is available at https://ekjt.github.io/OVAL-Grasp/

OVAL-Grasp: Open-Vocabulary Affordance Localization for Task Oriented Grasping

TL;DR

OVAL-Grasp addresses zero-shot task-oriented grasping by grounding language-described affordances to object parts using a modular LLM–VLM pipeline. The method decomposes objects into desirable/undesirable parts with an LLM, segments them with a VLM, constructs an affordance heatmap, and scores grasp proposals from a geometry-based generator to select the final grasp . In experiments on 20 objects with 3 tasks per object, OVAL-Grasp achieves Part Selection 95.0% and Grasp 78.3%, outperforming GraspGPT and ShapeGrasp, and demonstrates robustness under occlusion and clutter through ablations. The design is modular and scalable with improved foundation models, but real-time closed-loop control remains future work.

Abstract

To manipulate objects in novel, unstructured environments, robots need task-oriented grasps that target object parts based on the given task. Geometry-based methods often struggle with visually defined parts, occlusions, and unseen objects. We introduce OVAL-Grasp, a zero-shot open-vocabulary approach to task-oriented, affordance based grasping that uses large-language models and vision-language models to allow a robot to grasp objects at the correct part according to a given task. Given an RGB image and a task, OVAL-Grasp identifies parts to grasp or avoid with an LLM, segments them with a VLM, and generates a 2D heatmap of actionable regions on the object. During our evaluations, we found that our method outperformed two task oriented grasping baselines on experiments with 20 household objects with 3 unique tasks for each. OVAL-Grasp successfully identifies and segments the correct object part 95% of the time and grasps the correct actionable area 78.3% of the time in real-world experiments with the Fetch mobile manipulator. Additionally, OVAL-Grasp finds correct object parts under partial occlusions, demonstrating a part selection success rate of 80% in cluttered scenes. We also demonstrate OVAL-Grasp's efficacy in scenarios that rely on visual features for part selection, and show the benefit of a modular design through our ablation experiments. Our project webpage is available at https://ekjt.github.io/OVAL-Grasp/

Paper Structure

This paper contains 16 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: OVAL-Grasp at work on the Fetch mobile manipulator. The robot understands which parts of the object it should grasp and which parts should be avoided to fulfill the given tasks described by language.
  • Figure 2: System overview. The robot generates a task-oriented grasps by using an LLM to identify grasp-relevant object parts, a VLM to segment them, and a constructed heatmap to filter grasp candidates to produce a set of grasp that fulfill the given task.
  • Figure 3: Our experimental setup used the Fetch robot (left) and household and YCB objects (right) to evaluate OVAL-Grasp and baseline methods.
  • Figure 4: Examples of ShapeGrasp failures. When dealing with more convex geometries and parts that are flush with the object, ShapeGrasp fails to identify the part.
  • Figure 5: OVAL-Grasp idenitifes object parts not linked to the object's geomtery. Scores are assigned to the barcode and soup can label segments in the heatmap and grasps that obstruct them are filtered out.
  • ...and 2 more figures