RTAGrasp: Learning Task-Oriented Grasping from Human Videos via Retrieval, Transfer, and Alignment
Wenlong Dong, Dehao Huang, Jiangshan Liu, Chao Tang, Hong Zhang
TL;DR
TOG requires identifying both where to grasp and how to grasp to complete manipulation tasks. RTAGrasp introduces a Retrieval, Transfer, and Alignment framework that learns complete TOG constraints from human demonstration videos and applies them to robots without task-specific training, using memory-based demonstrations, semantic and geometric retrieval, and vision foundation-models for transfer, followed by an alignment-based selection of executable grasps. The method achieves strong performance on seen and unseen object categories on TaskGrasp, surpasses training-based baselines on unseen objects, and demonstrates practical real-world deployment with a Kinova arm, highlighting data efficiency and robustness to viewpoint changes. Overall, RTAGrasp offers a data-efficient, training-free pathway to task-oriented grasping by leveraging human experiences and foundation-model-based cross-object transfer.
Abstract
Task-oriented grasping (TOG) is crucial for robots to accomplish manipulation tasks, requiring the determination of TOG positions and directions. Existing methods either rely on costly manual TOG annotations or only extract coarse grasping positions or regions from human demonstrations, limiting their practicality in real-world applications. To address these limitations, we introduce RTAGrasp, a Retrieval, Transfer, and Alignment framework inspired by human grasping strategies. Specifically, our approach first effortlessly constructs a robot memory from human grasping demonstration videos, extracting both TOG position and direction constraints. Then, given a task instruction and a visual observation of the target object, RTAGrasp retrieves the most similar human grasping experience from its memory and leverages semantic matching capabilities of vision foundation models to transfer the TOG constraints to the target object in a training-free manner. Finally, RTAGrasp aligns the transferred TOG constraints with the robot's action for execution. Evaluations on the public TOG benchmark, TaskGrasp dataset, show the competitive performance of RTAGrasp on both seen and unseen object categories compared to existing baseline methods. Real-world experiments further validate its effectiveness on a robotic arm. Our code, appendix, and video are available at \url{https://sites.google.com/view/rtagrasp/home}.
