Table of Contents
Fetching ...

From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment

Han Zhou, Jinjin Cao, Liyuan Ma, Xueji Fang, Guo-jun Qi

TL;DR

This work tackles the data-efficiency bottleneck in robotic manipulation by proposing Traj2Action, a cross-embodiment transfer framework that uses the 3D trajectory of the operational endpoint as a unified intermediate representation. A coarse-to-fine policy is learned with a Trajectory Expert predicting a high-level trajectory and an Action Expert converting that plan into precise robot actions, trained via a joint denoising objective. By unifying human and robot demonstrations in the end-effector trajectory space and leveraging both human and robot data, Traj2Action achieves substantial gains over robot-only baselines, with up to +27% SR and +22.25% TP, and demonstrates improved data efficiency as human data scales, including notable zero-shot generalization to unseen goals. The approach significantly reduces reliance on costly robot demonstrations, enabling scalable, cross-embodiment skill transfer for real-world manipulation tasks on a Franka robot with a wrist-camera setup and SpaceMouse teleoperation.

Abstract

Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over $π_0$ baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning. Our project website, featuring code and video demonstrations, is available at https://anonymous.4open.science/w/Traj2Action-4A45/.

From Human Hands to Robot Arms: Manipulation Skills Transfer via Trajectory Alignment

TL;DR

This work tackles the data-efficiency bottleneck in robotic manipulation by proposing Traj2Action, a cross-embodiment transfer framework that uses the 3D trajectory of the operational endpoint as a unified intermediate representation. A coarse-to-fine policy is learned with a Trajectory Expert predicting a high-level trajectory and an Action Expert converting that plan into precise robot actions, trained via a joint denoising objective. By unifying human and robot demonstrations in the end-effector trajectory space and leveraging both human and robot data, Traj2Action achieves substantial gains over robot-only baselines, with up to +27% SR and +22.25% TP, and demonstrates improved data efficiency as human data scales, including notable zero-shot generalization to unseen goals. The approach significantly reduces reliance on costly robot demonstrations, enabling scalable, cross-embodiment skill transfer for real-world manipulation tasks on a Franka robot with a wrist-camera setup and SpaceMouse teleoperation.

Abstract

Learning diverse manipulation skills for real-world robots is severely bottlenecked by the reliance on costly and hard-to-scale teleoperated demonstrations. While human videos offer a scalable alternative, effectively transferring manipulation knowledge is fundamentally hindered by the significant morphological gap between human and robotic embodiments. To address this challenge and facilitate skill transfer from human to robot, we introduce Traj2Action,a novel framework that bridges this embodiment gap by using the 3D trajectory of the operational endpoint as a unified intermediate representation, and then transfers the manipulation knowledge embedded in this trajectory to the robot's actions. Our policy first learns to generate a coarse trajectory, which forms an high-level motion plan by leveraging both human and robot data. This plan then conditions the synthesis of precise, robot-specific actions (e.g., orientation and gripper state) within a co-denoising framework. Extensive real-world experiments on a Franka robot demonstrate that Traj2Action boosts the performance by up to 27% and 22.25% over baseline on short- and long-horizon real-world tasks, and achieves significant gains as human data scales in robot policy learning. Our project website, featuring code and video demonstrations, is available at https://anonymous.4open.science/w/Traj2Action-4A45/.

Paper Structure

This paper contains 27 sections, 3 equations, 9 figures, 4 tables, 2 algorithms.

Figures (9)

  • Figure 1: An overview of the Traj2Action framework. Given multi-view images and a language instruction, the model operates in a coarse-to-fine manner. A Trajectory Expert, trained on both human and robot data, first predicts a coarse 3D trajectory plan. This high-level plan then conditions an Action Expert to generate fine-grained robot actions, which include precise translation ($\Delta d$), rotation ($\Delta \theta$), and gripper state ($\Delta \mathrm{Grip}$). Both experts are optimized jointly within a co-denoising framework, enabling the coarse trajectory to guide the synthesis of fine-grained actions.
  • Figure 2: Illustration of our data collection systems for human hand motion (left) and robot teleoperation (right).
  • Figure 3: Visual illustration of four real-world tasks in Franka Research 3 robot.
  • Figure 4: Visual comparison of trajectory and action predicition of short- and long-horizon tasks pick up the tomato and put it in the tray (top) and stack the rings on the pillar (bottom), respectively.
  • Figure 5: Impact of Human Data Scale on Policy Performance. The chart displays the performance on the pick up the tomato and put it in the tray task (left) and the stack the paper cups task (right) as a function of the number of human demonstrations used in training.
  • ...and 4 more figures