Table of Contents
Fetching ...

HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation

Zhecheng Yuan, Tianming Wei, Langzhe Gu, Pu Hua, Tianhai Liang, Yuanpei Chen, Huazhe Xu

TL;DR

HERMES tackles the challenge of mobile bimanual dexterous manipulation by translating diverse one-shot human motions into robot policies through reinforcement learning. It integrates end-to-end depth-based sim2real transfer via DAgger distillation, a generalized object-centric reward design, and a hybrid sim2real control scheme, along with ViNT-based navigation and a closed-loop PnP localization module to bridge navigation and manipulation. The approach yields strong real-world performance, high sample efficiency, and robust zero-shot sim2real transfer across long-horizon tasks, outperforming non-learning baselines and demonstrating broad generalization in unstructured environments. This work provides a practical, scalable framework for leveraging multi-source human data to empower mobile dexterous manipulation with autonomous navigation capabilities. Overall, HERMES advances the deployment of complex manipulation policies in real-world settings by tightly integrating perception, learning, and navigation components.

Abstract

Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks. Project Page:https://gemcollector.github.io/HERMES/.

HERMES: Human-to-Robot Embodied Learning from Multi-Source Motion Data for Mobile Dexterous Manipulation

TL;DR

HERMES tackles the challenge of mobile bimanual dexterous manipulation by translating diverse one-shot human motions into robot policies through reinforcement learning. It integrates end-to-end depth-based sim2real transfer via DAgger distillation, a generalized object-centric reward design, and a hybrid sim2real control scheme, along with ViNT-based navigation and a closed-loop PnP localization module to bridge navigation and manipulation. The approach yields strong real-world performance, high sample efficiency, and robust zero-shot sim2real transfer across long-horizon tasks, outperforming non-learning baselines and demonstrating broad generalization in unstructured environments. This work provides a practical, scalable framework for leveraging multi-source human data to empower mobile dexterous manipulation with autonomous navigation capabilities. Overall, HERMES advances the deployment of complex manipulation policies in real-world settings by tightly integrating perception, learning, and navigation components.

Abstract

Leveraging human motion data to impart robots with versatile manipulation skills has emerged as a promising paradigm in robotic manipulation. Nevertheless, translating multi-source human hand motions into feasible robot behaviors remains challenging, particularly for robots equipped with multi-fingered dexterous hands characterized by complex, high-dimensional action spaces. Moreover, existing approaches often struggle to produce policies capable of adapting to diverse environmental conditions. In this paper, we introduce HERMES, a human-to-robot learning framework for mobile bimanual dexterous manipulation. First, HERMES formulates a unified reinforcement learning approach capable of seamlessly transforming heterogeneous human hand motions from multiple sources into physically plausible robotic behaviors. Subsequently, to mitigate the sim2real gap, we devise an end-to-end, depth image-based sim2real transfer method for improved generalization to real-world scenarios. Furthermore, to enable autonomous operation in varied and unstructured environments, we augment the navigation foundation model with a closed-loop Perspective-n-Point (PnP) localization mechanism, ensuring precise alignment of visual goals and effectively bridging autonomous navigation and dexterous manipulation. Extensive experimental results demonstrate that HERMES consistently exhibits generalizable behaviors across diverse, in-the-wild scenarios, successfully performing numerous complex mobile bimanual dexterous manipulation tasks. Project Page:https://gemcollector.github.io/HERMES/.

Paper Structure

This paper contains 44 sections, 9 equations, 24 figures, 9 tables, 3 algorithms.

Figures (24)

  • Figure 1: HERMES exhibits a rich spectrum of mobile bimanual dexterous manipulation skills. The robot is able to navigate over extended distances in both indoor and outdoor environments, and effectively execute a variety of complex manipulation tasks in unstructured, real-world scenarios, drawing upon behaviors learned from only one-shot human motion.
  • Figure 2: System Design. We construct a unified setup of mobile bimanual robots equipped with dexterous hands in both simulation and the real world. Through high-fidelity simulation, this robotic platform is capable of enabling sim2real transfer across a wide range of complex manipulation tasks.
  • Figure 3: The main pipeline of HERMES. HERMES comprises a four-stage pipeline for achieving mobile bimanual dexterous manipulation through sim2real transfer. First, we acquire a one‑shot human demonstration drawn from diverse sources. Then, in stage 2, we train a state-based RL teacher policy, then apply DAgger to distill it into a vision‑based student policy. Following this, HERMES execute long‑horizon navigation using ViNT, followed by closed-loop PnP to finely adjust the robot’s pose and achieve precise alignment in stage 3. Once localization is achieved, the student policy is deployed in a zero‑shot fashion directly in the real world.
  • Figure 4: Pose extraction from videos. We utilize FoundationPose to extract the pose trajectories of multiple objects and employ WiLoR to capture the poses of both hands along with the positions of their finger joints.
  • Figure 5: The visualization of hand motion trajectory. We utilize WiLoR along with a PnP algorithm to precisely transform the estimated hand poses into the robot’s frame.
  • ...and 19 more figures