Learning to Transfer Human Hand Skills for Robot Manipulations

Sungjae Park; Seungho Lee; Mingi Choi; Jiye Lee; Jeonghwan Kim; Jisoo Kim; Hanbyul Joo

Learning to Transfer Human Hand Skills for Robot Manipulations

Sungjae Park, Seungho Lee, Mingi Choi, Jiye Lee, Jeonghwan Kim, Jisoo Kim, Hanbyul Joo

TL;DR

The paper tackles the problem of transferring human hand dexterous manipulation to robot hands amid embodiment gaps by learning a joint spatio-temporal manifold over object trajectories, human hand motions, and robot actions, trained with pseudo-ground-truth triplets synthesized from separate mocap and teleoperation data. A convolutional autoencoder encodes $( extbf{O}, extbf{H}, extbf{R})$ into a latent code $ extbf{L}$, enabling inference of robot actions $ extbf{R}$ from given $ extbf{O}$ and $ extbf{H}$ via latent optimization, with an initial $ extbf{L}^{init}$ derived from a regression-based IK estimate. The proposed Hand-to-Robot Retargeting Model F and the synthetic data generation pipeline (Model S) achieve superior real-world performance on Bottle, Bowl, and Book tasks, demonstrating improved physical plausibility, robustness to mocap noise, and generalization to unseen trajectories. This approach offers a scalable data-driven path for translating human manipulation into robotic dexterity by explicitly modeling hand–object interactions rather than relying solely on kinematic point-matching.

Abstract

We present a method for teaching dexterous manipulation tasks to robots from human hand motion demonstrations. Unlike existing approaches that solely rely on kinematics information without taking into account the plausibility of robot and object interaction, our method directly infers plausible robot manipulation actions from human motion demonstrations. To address the embodiment gap between the human hand and the robot system, our approach learns a joint motion manifold that maps human hand movements, robot hand actions, and object movements in 3D, enabling us to infer one motion component from others. Our key idea is the generation of pseudo-supervision triplets, which pair human, object, and robot motion trajectories synthetically. Through real-world experiments with robot hand manipulation, we demonstrate that our data-driven retargeting method significantly outperforms conventional retargeting techniques, effectively bridging the embodiment gap between human and robotic hands. Website at https://rureadyo.github.io/MocapRobot/.

Learning to Transfer Human Hand Skills for Robot Manipulations

TL;DR

into a latent code

, enabling inference of robot actions

from given

and

via latent optimization, with an initial

derived from a regression-based IK estimate. The proposed Hand-to-Robot Retargeting Model F and the synthetic data generation pipeline (Model S) achieve superior real-world performance on Bottle, Bowl, and Book tasks, demonstrating improved physical plausibility, robustness to mocap noise, and generalization to unseen trajectories. This approach offers a scalable data-driven path for translating human manipulation into robotic dexterity by explicitly modeling hand–object interactions rather than relying solely on kinematic point-matching.

Abstract

Paper Structure (10 sections, 8 equations, 7 figures, 3 tables)

This paper contains 10 sections, 8 equations, 7 figures, 3 tables.

Introduction
Related Work
Method
Synthesizing Pseudo-GT Triplet DB.
Hardware System Setup for Data Collection
Evaluations
Experimental Setup
Synthetic Paired Dataset Generation Model S
Human-to-Robot Retargeting Model F
Discussion and Limitations

Figures (7)

Figure 1: Our model learns a human-to-robot retargeting model using an unpaired (i.e., object may move differently) human mocap and robot teleoperation dataset.
Figure 2: Overview of the Proposed Framework. We first synthesize the paired triplet dataset consisting of robot action and human motion achieving the same object trajectory, followed by learning a retargeting module. The retargeting model is evaluated in real world, and we use IsaacGym simulator for visualization only.
Figure 3: System Overview: Our system consists of 16 synchronized cameras, an xArm6 robot arm, and a 16-DoF Allegro robot hand.
Figure 4: Objects used in the experiment and a marker system for 3D tracking.
Figure 5: Visualization of robot teleoperation dataset. Yellow: desired robot joint values. White: actual robot joint values. The dataset is collected in real world, and Isaac Gym simulator is only used for rendering.
...and 2 more figures

Learning to Transfer Human Hand Skills for Robot Manipulations

TL;DR

Abstract

Learning to Transfer Human Hand Skills for Robot Manipulations

Authors

TL;DR

Abstract

Table of Contents

Figures (7)