A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery
Hanxin Zhang, Abdulqader Dhafer, Zhou Daniel Hao, Hongbiao Dong
TL;DR
This work tackles robot-to-human handovers by jointly inferring human intent from text and a receiving hand image and by imagining a spatial handover configuration using diffusion-based generation conditioned on task prompts. It combines multimodal large language models for intent recognition with a diffusion-based generator (Text2HOI) and a grasp-angle predictor (Multi-GraspLLM), followed by a pose-matching step to align imagined and real hands. In both simulation and on a UR5e platform, the approach yields natural, spatially coherent handover configurations with safety considerations (minimal contact) and demonstrates robustness across varying levels of input ambiguity. The study advances collaborative robotics by integrating perception, language, and generative geometry to support fluent, safe handovers, while highlighting future needs for real-time adaptation and broader generalization.
Abstract
We propose a novel system for robot-to-human object handover that emulates human coworker interactions. Unlike most existing studies that focus primarily on grasping strategies and motion planning, our system focus on 1. inferring human handover intents, 2. imagining spatial handover configuration. The first one integrates multimodal perception-combining visual and verbal cues-to infer human intent. The second one using a diffusion-based model to generate the handover configuration, involving the spacial relationship among robot's gripper, the object, and the human hand, thereby mimicking the cognitive process of motor imagery. Experimental results demonstrate that our approach effectively interprets human cues and achieves fluent, human-like handovers, offering a promising solution for collaborative robotics. Code, videos, and data are available at: https://i3handover.github.io.
