MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

Zifan Wang; Ziqing Chen; Junyu Chen; Jilong Wang; Yuxin Yang; Yunze Liu; Xueyi Liu; He Wang; Li Yi

MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

Zifan Wang, Ziqing Chen, Junyu Chen, Jilong Wang, Yuxin Yang, Yunze Liu, Xueyi Liu, He Wang, Li Yi

TL;DR

MobileH2R tackles generalizable human-to-mobile-robot handover in large workspaces by learning exclusively from synthetic data. It combines a scalable synthetic full-body motion pipeline, an automatic safe and imitation-friendly demonstration generator, and a $4D$ imitation-learning method to distill demonstrations into closed-loop base-arm policies, enabling effective sim2real transfer. The approach yields significant gains over baselines, with improved success rates and safety metrics, and demonstrates strong real-world performance on a mobile GALBOT platform. By validating across simulation and real-world tests, the work shows that high-quality synthetic data can replace real demonstrations for complex HRI tasks with mobile robots.

Abstract

This paper introduces MobileH2R, a framework for learning generalizable vision-based human-to-mobile-robot (H2MR) handover skills. Unlike traditional fixed-base handovers, this task requires a mobile robot to reliably receive objects in a large workspace enabled by its mobility. Our key insight is that generalizable handover skills can be developed in simulators using high-quality synthetic data, without the need for real-world demonstrations. To achieve this, we propose a scalable pipeline for generating diverse synthetic full-body human motion data, an automated method for creating safe and imitation-friendly demonstrations, and an efficient 4D imitation learning method for distilling large-scale demonstrations into closed-loop policies with base-arm coordination. Experimental evaluations in both simulators and the real world show significant improvements (at least +15% success rate) over baseline methods in all cases. Experiments also validate that large-scale and diverse synthetic data greatly enhances robot learning, highlighting our scalable framework.

MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

TL;DR

imitation-learning method to distill demonstrations into closed-loop base-arm policies, enabling effective sim2real transfer. The approach yields significant gains over baselines, with improved success rates and safety metrics, and demonstrates strong real-world performance on a mobile GALBOT platform. By validating across simulation and real-world tests, the work shows that high-quality synthetic data can replace real demonstrations for complex HRI tasks with mobile robots.

Abstract

Paper Structure (35 sections, 1 equation, 9 figures, 9 tables)

This paper contains 35 sections, 1 equation, 9 figures, 9 tables.

Introduction
Related Work
Human-to-Robot Handovers
Mobile Robot Manipulation
Scaling up Demonstrations for Imitation
Method
MobileH2R-Sim
Safe and Imitation-friendly Demonstration
Imitation for Coordinated Base-Arm Actions
Experiments
Evaluation on Different Methods
Evaluation on Data Scaling
Evaluation on Demonstration Strategies
Ablations
Real World Experiments
...and 20 more sections

Figures (9)

Figure 1: The overview of MobileH2R. We propose a framework for generalizable human-to-mobile-robot handover, including a scalable pipeline for diverse full-body human motion synthesis (a), an automatic method for producing safe, imitation-friendly demonstrations (b), an efficient 4D imitation learning approach to learn coordinated base-arm actions (c), and successful sim2real transfer (d).
Figure 2: The overview of our framework. First, we propose an automatic pipeline to scale up synthetic and diverse full-body motion data for the handover task by integrating various synthetic digital asset libraries, generative models, and useful toolkits. Second, we introduce an automatic pipeline to scale up mobile robot demonstrations for safety and imitation-friendliness. Our approach aims to avoid collisions while enhancing the vision-action correlation through carefully designed loss functions. Third, we employ a 4D imitation learning policy to learn 9D coordinated arm-base actions. We process point clouds of both objects and human bodies by modified PointNet++.
Figure 3: Visualization for the vision neural loss. The Pose Prediction Network takes vision inputs and predicts the object pose. The prediction error is defined as the vision neural loss. The Vision-State Recovery Estimator takes states as input and estimates the vision neural loss, guiding the state-based trajectory optimization towards imitation-friendly demonstration generation.
Figure 4: Qualitative results. We compare different methods in detail in the simulated scene and the real-world scene.
Figure 5: Template prompt to LLMs to generate direct object-aware motion description for controllable motion generator.
...and 4 more figures

MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

TL;DR

Abstract

MobileH2R: Learning Generalizable Human to Mobile Robot Handover Exclusively from Scalable and Diverse Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (9)