GAT-Grasp: Gesture-Driven Affordance Transfer for Task-Aware Robotic Grasping
Ruixiang Wang, Huayi Zhou, Xinyue Yao, Guiliang Liu, Kui Jia
TL;DR
GAT-Grasp presents a gesture-conditioned, retrieval-based framework for task-aware robotic grasping that maps human pointing and grasp gestures to region localization and orientation-constrained grasps without object priors. It builds an Affordance Memory Bank from large HOI video data and uses hierarchical retrieval to transfer hand-grasp affordances to target objects, followed by a hand-to-gripper rotation mapping and optional integration with existing grasp generators. The approach achieves robust, zero-shot generalization and outperforms language- and vision-based baselines in cluttered real-world scenes, with ablations confirming the critical roles of pointing, visuospatial transfer, and rotation constraints. This work advances intuitive human-robot collaboration by enabling precise, task-specific grasps through non-verbal gestures and scalable affordance transfer.
Abstract
Achieving precise and generalizable grasping across diverse objects and environments is essential for intelligent and collaborative robotic systems. However, existing approaches often struggle with ambiguous affordance reasoning and limited adaptability to unseen objects, leading to suboptimal grasp execution. In this work, we propose GAT-Grasp, a gesture-driven grasping framework that directly utilizes human hand gestures to guide the generation of task-specific grasp poses with appropriate positioning and orientation. Specifically, we introduce a retrieval-based affordance transfer paradigm, leveraging the implicit correlation between hand gestures and object affordances to extract grasping knowledge from large-scale human-object interaction videos. By eliminating the reliance on pre-given object priors, GAT-Grasp enables zero-shot generalization to novel objects and cluttered environments. Real-world evaluations confirm its robustness across diverse and unseen scenarios, demonstrating reliable grasp execution in complex task settings.
