Table of Contents
Fetching ...

GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation

Zifan Wang, Junyu Chen, Ziqing Chen, Pengwei Xie, Rui Chen, Li Yi

TL;DR

GenH2R tackles the problem of generalizing vision-based human-to-robot handovers to unseen object geometries and dynamic human motions. It introduces GenH2R-Sim for large-scale synthetic data, a distillation-friendly demonstration pipeline, and a forecast-aided 4D imitation learner that leverages 4D point-cloud observations to predict both actions and future object poses. Key contributions include a million-demo generation workflow, landmark-based planning to maintain vision-action correlation, and a joint action-prediction training objective that improves distillation efficiency. Empirical results show at least 10% improvement in success rates across benchmarks and successful sim-to-real transfer to real robots, highlighting scalable generalization and practical impact for real-world H2R handovers.

Abstract

This paper presents GenH2R, a framework for learning generalizable vision-based human-to-robot (H2R) handover skills. The goal is to equip robots with the ability to reliably receive objects with unseen geometry handed over by humans in various complex trajectories. We acquire such generalizability by learning H2R handover at scale with a comprehensive solution including procedural simulation assets creation, automated demonstration generation, and effective imitation learning. We leverage large-scale 3D model repositories, dexterous grasp generation methods, and curve-based 3D animation to create an H2R handover simulation environment named \simabbns, surpassing the number of scenes in existing simulators by three orders of magnitude. We further introduce a distillation-friendly demonstration generation method that automatically generates a million high-quality demonstrations suitable for learning. Finally, we present a 4D imitation learning method augmented by a future forecasting objective to distill demonstrations into a visuo-motor handover policy. Experimental evaluations in both simulators and the real world demonstrate significant improvements (at least +10\% success rate) over baselines in all cases. The project page is https://GenH2R.github.io/.

GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation, Demonstration, and Imitation

TL;DR

GenH2R tackles the problem of generalizing vision-based human-to-robot handovers to unseen object geometries and dynamic human motions. It introduces GenH2R-Sim for large-scale synthetic data, a distillation-friendly demonstration pipeline, and a forecast-aided 4D imitation learner that leverages 4D point-cloud observations to predict both actions and future object poses. Key contributions include a million-demo generation workflow, landmark-based planning to maintain vision-action correlation, and a joint action-prediction training objective that improves distillation efficiency. Empirical results show at least 10% improvement in success rates across benchmarks and successful sim-to-real transfer to real robots, highlighting scalable generalization and practical impact for real-world H2R handovers.

Abstract

This paper presents GenH2R, a framework for learning generalizable vision-based human-to-robot (H2R) handover skills. The goal is to equip robots with the ability to reliably receive objects with unseen geometry handed over by humans in various complex trajectories. We acquire such generalizability by learning H2R handover at scale with a comprehensive solution including procedural simulation assets creation, automated demonstration generation, and effective imitation learning. We leverage large-scale 3D model repositories, dexterous grasp generation methods, and curve-based 3D animation to create an H2R handover simulation environment named \simabbns, surpassing the number of scenes in existing simulators by three orders of magnitude. We further introduce a distillation-friendly demonstration generation method that automatically generates a million high-quality demonstrations suitable for learning. Finally, we present a 4D imitation learning method augmented by a future forecasting objective to distill demonstrations into a visuo-motor handover policy. Experimental evaluations in both simulators and the real world demonstrate significant improvements (at least +10\% success rate) over baselines in all cases. The project page is https://GenH2R.github.io/.
Paper Structure (35 sections, 3 equations, 9 figures, 7 tables)

This paper contains 35 sections, 3 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: The overview of GenH2R. We introduce a framework for learning generalizable vision-based human-to-robot handover via scalable synthetic simulation, distillation-friendly expert demonstration generation, and a forecast-aided 4D imitation learning method. Our models demonstrate strong generalization capabilities to real datasets and can be deployed to a real robot.
  • Figure 2: The overview of our framework. First, we propose a new simulation environment named GenH2R-Sim, featuring large-scale synthetic datasets with diversity in object geometry, grasp poses, and complex trajectories. Second, other than destination planning (move straight toward the final position) and dense planning (replan at each step), we propose a distillation-friendly demonstration generation method—landmark planning, predicting landmarks on the trajectory (as indicated by the dashed object above) and replanning based on those landmarks. Thirdly, our Forecast-aided 4D Imitation Learning leverages past flow information, and the forecasting objective enhances the exploitation of vision-action correlation.
  • Figure 3: Different demonstration generation methods for dynamic handover. The orange curve shows the hand-object trajectory. The blue, red, and green curves show the example trajectories generated by the foresighted planner, the shortsighted planner, and our planner, respectively.
  • Figure 4: Qualitative results. We in detail compare different methods in simulators and deploy them in the real-world platform.
  • Figure 5: Forecast-Aided 4D Imitation Learning Pipeline. The network receives egocentric point cloud input and produces egocentric 6D actions as output. For each input point, we compute its past coordinates using flow information obtained through the Iterative Closest Point algorithm. Subsequently, we employ PointNet++ to encode the processed point cloud into a low-dimensional global feature. The policy head decodes this feature into a 6D egocentric action, serving as the primary policy output. Simultaneously, the prediction head decodes the feature into future pose transformations, contributing to the auxiliary loss.
  • ...and 4 more figures