Table of Contents
Fetching ...

RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment

Wenlong Dong, Dehao Huang, Jiangshan Liu, Chao Tang, Hong Zhang

TL;DR

TOG requires identifying both where to grasp and how to grasp to complete manipulation tasks. RTAGrasp introduces a Retrieval, Transfer, and Alignment framework that learns complete TOG constraints from human demonstration videos and applies them to robots without task-specific training, using memory-based demonstrations, semantic and geometric retrieval, and vision foundation-models for transfer, followed by an alignment-based selection of executable grasps. The method achieves strong performance on seen and unseen object categories on TaskGrasp, surpasses training-based baselines on unseen objects, and demonstrates practical real-world deployment with a Kinova arm, highlighting data efficiency and robustness to viewpoint changes. Overall, RTAGrasp offers a data-efficient, training-free pathway to task-oriented grasping by leveraging human experiences and foundation-model-based cross-object transfer.

Abstract

Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual TOG annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show the competitive performance of RTAGrasp on both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm. Our code, appendix, and video are available at \url{https://sites.google.com/view/rtagrasp/home}.

RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment

TL;DR

TOG requires identifying both where to grasp and how to grasp to complete manipulation tasks. RTAGrasp introduces a Retrieval, Transfer, and Alignment framework that learns complete TOG constraints from human demonstration videos and applies them to robots without task-specific training, using memory-based demonstrations, semantic and geometric retrieval, and vision foundation-models for transfer, followed by an alignment-based selection of executable grasps. The method achieves strong performance on seen and unseen object categories on TaskGrasp, surpasses training-based baselines on unseen objects, and demonstrates practical real-world deployment with a Kinova arm, highlighting data efficiency and robustness to viewpoint changes. Overall, RTAGrasp offers a data-efficient, training-free pathway to task-oriented grasping by leveraging human experiences and foundation-model-based cross-object transfer.

Abstract

Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual TOG annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show the competitive performance of RTAGrasp on both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm. Our code, appendix, and video are available at \url{https://sites.google.com/view/rtagrasp/home}.
Paper Structure (15 sections, 2 equations, 5 figures, 5 tables)

This paper contains 15 sections, 2 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Robots learn TOG skills from human demonstration videos through Retrieval, Transfer, and Alignment. (b) An incompatible TOG position or direction could result in the failure to complete subsequent tasks.
  • Figure 2: Overview: the pipeline first utilizes (a) a retrieval module to retrieve the optimal grasping experience (i.e., TOG constraints) from the memory. Next, it uses (b) a transfer module to transfer the retrieved TOG constraints to the target object to obtain the TOG position $\mathbf{p}_{B}$ and the TOG direction $\mathbf{v}_{B}$. Finally, (c) an alignment module aligns the transferred TOG constraints to the robot's action for execution.
  • Figure 3: Each instance in the robot memory consists of an object-centered RGB image, a TOG position $\mathbf{p}_A$, a TOG direction $\mathbf{v}_A$, and a task instruction.
  • Figure 4: Qualitative results of TOG. Each row is a visualization of the intermediate results for an object in an experimental scene.
  • Figure 5: Performances with different data amounts for retrieval.