Table of Contents
Fetching ...

A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery

Hanxin Zhang, Abdulqader Dhafer, Zhou Daniel Hao, Hongbiao Dong

TL;DR

This work tackles robot-to-human handovers by jointly inferring human intent from text and a receiving hand image and by imagining a spatial handover configuration using diffusion-based generation conditioned on task prompts. It combines multimodal large language models for intent recognition with a diffusion-based generator (Text2HOI) and a grasp-angle predictor (Multi-GraspLLM), followed by a pose-matching step to align imagined and real hands. In both simulation and on a UR5e platform, the approach yields natural, spatially coherent handover configurations with safety considerations (minimal contact) and demonstrates robustness across varying levels of input ambiguity. The study advances collaborative robotics by integrating perception, language, and generative geometry to support fluent, safe handovers, while highlighting future needs for real-time adaptation and broader generalization.

Abstract

We propose a novel system for robot-to-human object handover that emulates human coworker interactions. Unlike most existing studies that focus primarily on grasping strategies and motion planning, our system focus on 1. inferring human handover intents, 2. imagining spatial handover configuration. The first one integrates multimodal perception-combining visual and verbal cues-to infer human intent. The second one using a diffusion-based model to generate the handover configuration, involving the spacial relationship among robot's gripper, the object, and the human hand, thereby mimicking the cognitive process of motor imagery. Experimental results demonstrate that our approach effectively interprets human cues and achieves fluent, human-like handovers, offering a promising solution for collaborative robotics. Code, videos, and data are available at: https://i3handover.github.io.

A Generative System for Robot-to-Human Handovers: from Intent Inference to Spatial Configuration Imagery

TL;DR

This work tackles robot-to-human handovers by jointly inferring human intent from text and a receiving hand image and by imagining a spatial handover configuration using diffusion-based generation conditioned on task prompts. It combines multimodal large language models for intent recognition with a diffusion-based generator (Text2HOI) and a grasp-angle predictor (Multi-GraspLLM), followed by a pose-matching step to align imagined and real hands. In both simulation and on a UR5e platform, the approach yields natural, spatially coherent handover configurations with safety considerations (minimal contact) and demonstrates robustness across varying levels of input ambiguity. The study advances collaborative robotics by integrating perception, language, and generative geometry to support fluent, safe handovers, while highlighting future needs for real-time adaptation and broader generalization.

Abstract

We propose a novel system for robot-to-human object handover that emulates human coworker interactions. Unlike most existing studies that focus primarily on grasping strategies and motion planning, our system focus on 1. inferring human handover intents, 2. imagining spatial handover configuration. The first one integrates multimodal perception-combining visual and verbal cues-to infer human intent. The second one using a diffusion-based model to generate the handover configuration, involving the spacial relationship among robot's gripper, the object, and the human hand, thereby mimicking the cognitive process of motor imagery. Experimental results demonstrate that our approach effectively interprets human cues and achieves fluent, human-like handovers, offering a promising solution for collaborative robotics. Code, videos, and data are available at: https://i3handover.github.io.

Paper Structure

This paper contains 12 sections, 5 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Our approach consists two stages: handover intents inferring and handover configuration imagery. The user conveys handover intent through text input and a receiving hand image. At first stage, robot infers a task description (e.g., “Pass the game controller to the right hand”) as an input to the next stage. With the point cloud of object in the task description, the robot generates a receiving hand pose, and estimates several potential grasping angles for gripper, then determines the optimal handover pose.
  • Figure 2: Flowchart of generating handover configuration, integrating a text-guided diffusion model and an LLM. Both models take 3D object point cloud and a task description as inputs. The CT-CVAE generates the contact map, while CLIP encodes the textual input into an embedding vector, providing conditional guidance for the diffusion model to predict the receiving hand pose. Simultaneously, Multi-GraspLLM generates multiple candidate grasp angles, which are then refined by the angle selection strategy to determine the optimal one.The final outputs is the most suitable spatial handover configuration among the object, receiving hand, and robotic gripper.
  • Figure 3: MLLM interpretation of intent. The LLM processes textual and visual inputs through a structured prompt template to generate a task description.
  • Figure 4: Hand coordinate matching. The robot generates an imagined handover configuration. This imagined receiving hand is matched with the actual receiving hand to ensure proper spatial consistency.
  • Figure 5: Experimental setup. The tests were conducted in both simulation and the real hardware.
  • ...and 3 more figures