Table of Contents
Fetching ...

Leveraging Semantic and Geometric Information for Zero-Shot Robot-to-Human Handover

Jiangshan Liu, Wenlong Dong, Jiankun Wang, Max Q. -H. Meng

TL;DR

This work tackles zero-shot robot-to-human handover by fusing semantic grounding from vision-language models with geometric constraints to determine optimal handover grasps. It introduces three modules: region grounding using Set-of-Mark prompted VLMs and SAM-based segmentation to define human and robot grasp regions; a grasp selection mechanism that generates diverse candidates with Contact-GraspNet and ranks them by distance and approach angle relative to the human region; and an execution module that optimizes handover pose for ergonomic human interaction using a two-joint planar arm model. Ablation studies show that both semantic region grounding and geometric ranking improve handover success, while real-world experiments and a user study demonstrate higher success rates and more user-preferred handovers compared to baselines like AffNet-DR and LAN-grasp. The approach advances zero-shot manipulation in HRI by leveraging foundation-model grounding with geometry-aware constraints to enhance generalization across objects and improve practical usability in human environments.

Abstract

Human-robot interaction (HRI) encompasses a wide range of collaborative tasks, with handover being one of the most fundamental. As robots become more integrated into human environments, the potential for service robots to assist in handing objects to humans is increasingly promising. In robot-to-human (R2H) handover, selecting the optimal grasp is crucial for success, as it requires avoiding interference with the humans preferred grasp region and minimizing intrusion into their workspace. Existing methods either inadequately consider geometric information or rely on data-driven approaches, which often struggle to generalize across diverse objects. To address these limitations, we propose a novel zero-shot system that combines semantic and geometric information to generate optimal handover grasps. Our method first identifies grasp regions using semantic knowledge from vision-language models (VLMs) and, by incorporating customized visual prompts, achieves finer granularity in region grounding. A grasp is then selected based on grasp distance and approach angle to maximize human ease and avoid interference. We validate our approach through ablation studies and real-world comparison experiments. Results demonstrate that our system improves handover success rates and provides a more user-preferred interaction experience. Videos, appendixes and more are available at https://sites.google.com/view/vlm-handover/.

Leveraging Semantic and Geometric Information for Zero-Shot Robot-to-Human Handover

TL;DR

This work tackles zero-shot robot-to-human handover by fusing semantic grounding from vision-language models with geometric constraints to determine optimal handover grasps. It introduces three modules: region grounding using Set-of-Mark prompted VLMs and SAM-based segmentation to define human and robot grasp regions; a grasp selection mechanism that generates diverse candidates with Contact-GraspNet and ranks them by distance and approach angle relative to the human region; and an execution module that optimizes handover pose for ergonomic human interaction using a two-joint planar arm model. Ablation studies show that both semantic region grounding and geometric ranking improve handover success, while real-world experiments and a user study demonstrate higher success rates and more user-preferred handovers compared to baselines like AffNet-DR and LAN-grasp. The approach advances zero-shot manipulation in HRI by leveraging foundation-model grounding with geometry-aware constraints to enhance generalization across objects and improve practical usability in human environments.

Abstract

Human-robot interaction (HRI) encompasses a wide range of collaborative tasks, with handover being one of the most fundamental. As robots become more integrated into human environments, the potential for service robots to assist in handing objects to humans is increasingly promising. In robot-to-human (R2H) handover, selecting the optimal grasp is crucial for success, as it requires avoiding interference with the humans preferred grasp region and minimizing intrusion into their workspace. Existing methods either inadequately consider geometric information or rely on data-driven approaches, which often struggle to generalize across diverse objects. To address these limitations, we propose a novel zero-shot system that combines semantic and geometric information to generate optimal handover grasps. Our method first identifies grasp regions using semantic knowledge from vision-language models (VLMs) and, by incorporating customized visual prompts, achieves finer granularity in region grounding. A grasp is then selected based on grasp distance and approach angle to maximize human ease and avoid interference. We validate our approach through ablation studies and real-world comparison experiments. Results demonstrate that our system improves handover success rates and provides a more user-preferred interaction experience. Videos, appendixes and more are available at https://sites.google.com/view/vlm-handover/.
Paper Structure (14 sections, 5 equations, 5 figures, 3 tables)

This paper contains 14 sections, 5 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a) Successful robot-to-human handovers share the same pattern in which humans tend to grasp regions that are conducive to the object's intended function. (b) A successful grasp for handover takes into account the robot's grasp region and direction. Red and green masks are predicted regions where robots and humans grasp. (c) A grasp fails in handover for grasping on region that humans prefer to grasp. (d) A grasp fails in handover for grasping from an inappropriate direction by intruding humans' workspace.
  • Figure 2: Overview of our proposed system.
  • Figure 3: Qualitative results of real-world experiments of our method. Each row is a visualization of intermediate results and real-world robot execution is shown in the last column.
  • Figure 4: Results of our user study in comparison experiments. The horizontal axis represents different items, while the vertical axis shows the average Likert ratings. Three methods are distinguished by different colors.
  • Figure 5: Some showcases of real-world robot experiments. Our method can select grasps in proper region with larger approach angles and provide more space in handover compared with other two methods.