Leveraging Semantic and Geometric Information for Zero-Shot Robot-to-Human Handover
Jiangshan Liu, Wenlong Dong, Jiankun Wang, Max Q. -H. Meng
TL;DR
This work tackles zero-shot robot-to-human handover by fusing semantic grounding from vision-language models with geometric constraints to determine optimal handover grasps. It introduces three modules: region grounding using Set-of-Mark prompted VLMs and SAM-based segmentation to define human and robot grasp regions; a grasp selection mechanism that generates diverse candidates with Contact-GraspNet and ranks them by distance and approach angle relative to the human region; and an execution module that optimizes handover pose for ergonomic human interaction using a two-joint planar arm model. Ablation studies show that both semantic region grounding and geometric ranking improve handover success, while real-world experiments and a user study demonstrate higher success rates and more user-preferred handovers compared to baselines like AffNet-DR and LAN-grasp. The approach advances zero-shot manipulation in HRI by leveraging foundation-model grounding with geometry-aware constraints to enhance generalization across objects and improve practical usability in human environments.
Abstract
Human-robot interaction (HRI) encompasses a wide range of collaborative tasks, with handover being one of the most fundamental. As robots become more integrated into human environments, the potential for service robots to assist in handing objects to humans is increasingly promising. In robot-to-human (R2H) handover, selecting the optimal grasp is crucial for success, as it requires avoiding interference with the humans preferred grasp region and minimizing intrusion into their workspace. Existing methods either inadequately consider geometric information or rely on data-driven approaches, which often struggle to generalize across diverse objects. To address these limitations, we propose a novel zero-shot system that combines semantic and geometric information to generate optimal handover grasps. Our method first identifies grasp regions using semantic knowledge from vision-language models (VLMs) and, by incorporating customized visual prompts, achieves finer granularity in region grounding. A grasp is then selected based on grasp distance and approach angle to maximize human ease and avoid interference. We validate our approach through ablation studies and real-world comparison experiments. Results demonstrate that our system improves handover success rates and provides a more user-preferred interaction experience. Videos, appendixes and more are available at https://sites.google.com/view/vlm-handover/.
