Table of Contents
Fetching ...

FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models

Chao Tang, Dehao Huang, Wenlong Dong, Ruinian Xu, Hong Zhang

TL;DR

The proposed FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to improve the generalization capabilities of existing TOG methods, has broad applicability to various scenarios involving tool manipulation.

Abstract

Task-oriented grasping (TOG), which refers to synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the intricate relationship between objects, tasks, and grasps necessitates rich semantic and geometric prior knowledge about these elements. Existing methods typically restrict the prior knowledge to a closed-set scope, limiting their generalization to novel objects and tasks out of the training set. To address such a limitation, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Extensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoundationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoundationGrasp is validated in real-robot grasping and manipulation experiments on a 7-DoF robotic arm. Our code, data, appendix, and video are publicly available at https://sites.google.com/view/foundationgrasp.

FoundationGrasp: Generalizable Task-Oriented Grasping with Foundation Models

TL;DR

The proposed FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to improve the generalization capabilities of existing TOG methods, has broad applicability to various scenarios involving tool manipulation.

Abstract

Task-oriented grasping (TOG), which refers to synthesizing grasps on an object that are configurationally compatible with the downstream manipulation task, is the first milestone towards tool manipulation. Analogous to the activation of two brain regions responsible for semantic and geometric reasoning during cognitive processes, modeling the intricate relationship between objects, tasks, and grasps necessitates rich semantic and geometric prior knowledge about these elements. Existing methods typically restrict the prior knowledge to a closed-set scope, limiting their generalization to novel objects and tasks out of the training set. To address such a limitation, we propose FoundationGrasp, a foundation model-based TOG framework that leverages the open-ended knowledge from foundation models to learn generalizable TOG skills. Extensive experiments are conducted on the contributed Language and Vision Augmented TaskGrasp (LaViA-TaskGrasp) dataset, demonstrating the superiority of FoundationGrasp over existing methods when generalizing to novel object instances, object classes, and tasks out of the training set. Furthermore, the effectiveness of FoundationGrasp is validated in real-robot grasping and manipulation experiments on a 7-DoF robotic arm. Our code, data, appendix, and video are publicly available at https://sites.google.com/view/foundationgrasp.
Paper Structure (22 sections, 12 equations, 12 figures, 11 tables)

This paper contains 22 sections, 12 equations, 12 figures, 11 tables.

Figures (12)

  • Figure 1: An overview of FoundationGrasp framework: The pipeline consists of (1) knowledge generation, (2) multi-modal feature representation, and (3) task-oriented grasp evaluation. When presented with a language instruction, FoundationGrasp first prompts an LLM to generate semantic and geometric descriptions of the object and the task. A web-based image retrieval module crowdsources images from the Internet. Subsequently, an LLM-based semantic knowledge encoder and a VLM-based geometric knowledge encoder transform them with multi-modal sensory inputs into their latent space feature representations. In the final stage, a Transformer-based task-oriented grasp evaluator with semantic and geometric branches evaluates the task compatibility of each grasp candidate.
  • Figure 2: A grasp $g$ is represented with six control points $X_g$ on the gripper model in the object reference frame. The origin of the object frame is $\overline{X}$, the center of mass of the object point cloud $X_o$.
  • Figure 3: Task-oriented grasp evaluator is a customized Transformer consisting of a geometric branch $TGE_{geo}$ (left) and a semantic branch $TGE_{sem}$ (right).
  • Figure 4: Novel objects tested in real-robot experiments cover commonly used accessories, kitchen utensils, and mechanic tools.
  • Figure 5: Quantitative results of perception experiments
  • ...and 7 more figures