From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment
Han Zhou, Jinjin Cao, Liyuan Ma, Xueji Fang, Guo-jun Qi
TL;DR
This work tackles the data-efficiency bottleneck in robotic manipulation by proposing Traj2Action, a cross-embodiment transfer framework that uses the 3D trajectory of the operational endpoint as a unified intermediate representation. A coarse-to-fine policy is learned with a Trajectory Expert predicting a high-level trajectory and an Action Expert converting that plan into precise robot actions, trained via a joint denoising objective. By unifying human and robot demonstrations in the end-effector trajectory space and leveraging both human and robot data, Traj2Action achieves substantial gains over robot-only baselines, with up to +27% SR and +22.25% TP, and demonstrates improved data efficiency as human data scales, including notable zero-shot generalization to unseen goals. The approach significantly reduces reliance on costly robot demonstrations, enabling scalable, cross-embodiment skill transfer for real-world manipulation tasks on a Franka robot with a wrist-camera setup and SpaceMouse teleoperation.
Abstract
Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over $π_0$ baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning. Our project website, featuring code and video demonstrations, is available at https://anonymous.4open.science/w/Traj2Action-4A45/.
