Table of Contents
Fetching ...

GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping

Teli Ma, Zifan Wang, Jiaming Zhou, Mengmeng Wang, Junwei Liang

TL;DR

GLOVER tackles open-vocabulary robotic grasping by combining fine-tuned language-vision models with a continuous affordance reasoning framework. It replaces costly 3D radiance rendering and offline affordance memory with an end-to-end approach that outputs visual affordance masks conditioned on language and image inputs, then uses a non-parametric AGE module to align grasp poses with recovered affordance geometry via superquadrics. A VL-Affordance dataset (over 12k images with 52k interactions) supports multimodal fine-tuning, enabling the model to inherit world knowledge from LLMs while learning to infer fine-grained graspable parts. Experimental results across 30 real-world scenes and humanoid embodiments show substantial gains in affordance reasoning and grasp success, with notable speedups in both reasoning (≈29x) and pose estimation (≈40x) over prior methods. The approach offers practical impact for real-time, open-vocabulary manipulation in diverse environments and embodiments.

Abstract

Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict the visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 table-top real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 29 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art. We also validate the generalization across embodiments, showing effectiveness in humanoid robots with dexterous hands.

GLOVER: Generalizable Open-Vocabulary Affordance Reasoning for Task-Oriented Grasping

TL;DR

GLOVER tackles open-vocabulary robotic grasping by combining fine-tuned language-vision models with a continuous affordance reasoning framework. It replaces costly 3D radiance rendering and offline affordance memory with an end-to-end approach that outputs visual affordance masks conditioned on language and image inputs, then uses a non-parametric AGE module to align grasp poses with recovered affordance geometry via superquadrics. A VL-Affordance dataset (over 12k images with 52k interactions) supports multimodal fine-tuning, enabling the model to inherit world knowledge from LLMs while learning to infer fine-grained graspable parts. Experimental results across 30 real-world scenes and humanoid embodiments show substantial gains in affordance reasoning and grasp success, with notable speedups in both reasoning (≈29x) and pose estimation (≈40x) over prior methods. The approach offers practical impact for real-time, open-vocabulary manipulation in diverse environments and embodiments.

Abstract

Inferring affordable (i.e., graspable) parts of arbitrary objects based on human specifications is essential for robots advancing toward open-vocabulary manipulation. Current grasp planners, however, are hindered by limited vision-language comprehension and time-consuming 3D radiance modeling, restricting real-time, open-vocabulary interactions with objects. To address these limitations, we propose GLOVER, a unified Generalizable Open-Vocabulary Affordance Reasoning framework, which fine-tunes the Large Language Models (LLMs) to predict the visual affordance of graspable object parts within RGB feature space. We compile a dataset of over 10,000 images from human-object interactions, annotated with unified visual and linguistic affordance labels, to enable multi-modal fine-tuning. GLOVER inherits world knowledge and common-sense reasoning from LLMs, facilitating more fine-grained object understanding and sophisticated tool-use reasoning. To enable effective real-world deployment, we present Affordance-Aware Grasping Estimation (AGE), a non-parametric grasp planner that aligns the gripper pose with a superquadric surface derived from affordance data. In evaluations across 30 table-top real-world scenes, GLOVER achieves success rates of 86.0% in part identification and 76.3% in grasping, with speeds approximately 29 times faster in affordance reasoning and 40 times faster in grasping pose estimation than the previous state-of-the-art. We also validate the generalization across embodiments, showing effectiveness in humanoid robots with dexterous hands.

Paper Structure

This paper contains 25 sections, 13 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: An overview of our method (Our contributions are highlighted with colored numbers). 1. We annotate the categories with the VLM and unify the affordance representation.2. We fine-tune the affordance decoder to decode the affordance token [AFF], which encodes multi-modal information from multi-modal LLM. The fine-tuned GLOVER infers visual affordance in an open-vocabulary manner.3. The affordance-aware grasping estimation module (AGE), including (a) Voxel down-sampled point clouds. (b) Filter the noise with DBSCAN schubert2017dbscan clustering. (c) Recover superquadric $\mathcal{A}$ from filtered stereo affordance. (d) Denote the gripper similarly as an ellipsoid surface $\mathcal{G}$. (e) Estimate the grasp pose by aligning the $\mathcal{A}$ and $\mathcal{G}$.
  • Figure 2: Eval Examples: Examples of inferred visual affordance and grasping pose in multiple scenes. The testing scenes are designed to evaluate the model's compositional object understanding (attributes, relations, complex scenes) and task-aware tools using (tool using, function reasoning). Generalization: GLOVER presents open-vocabulary ability across diverse environments, including real-world, simulator (RLBench james2020rlbench), scenes from other datasets (Ego4D grauman2022ego4d), and across-embodiments (humanoid robots with dexterous hands).
  • Figure 3: The experiment setting of single table-top robotic arm.
  • Figure 4: The experiment setting of a humanoid robot with dexterous hands.
  • Figure 5: Failure cases. Row 1 & 2: Rough grasping motions lead to collisions and overturns. Row 3: The cumulative error in hand-eye calibration leads to inaccurate positions.
  • ...and 2 more figures