Table of Contents
Fetching ...

Trajectory Conditioned Cross-embodiment Skill Transfer

YuHang Tang, Yixuan Lou, Pengfei Han, Haoming Song, Xinyi Ye, Dong Wang, Bin Zhao

TL;DR

TrajSkill addresses the challenge of transferring manipulation skills from human demonstration videos to robots with different morphologies. It introduces an embodiment-agnostic representation based on sparse optical flow trajectories and a two-stage trajectory-conditioned diffusion framework that generates robot motion videos, which are then translated into executable actions, enabling zero-shot cross-embodiment imitation without paired data or reinforcement learning. Extensive experiments on MetaWorld and real kitchen tasks demonstrate significant improvements in video realism metrics ($FVD$,$KVD$) and cross-embodiment success rates, as well as robust real-robot performance. The approach offers a scalable pathway for learning from unstructured human videos across diverse robot morphologies and tasks, with future work toward longer-horizon tasks and language-grounded task specifications.

Abstract

Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6\% and KVD by 36.6\% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7\%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.

Trajectory Conditioned Cross-embodiment Skill Transfer

TL;DR

TrajSkill addresses the challenge of transferring manipulation skills from human demonstration videos to robots with different morphologies. It introduces an embodiment-agnostic representation based on sparse optical flow trajectories and a two-stage trajectory-conditioned diffusion framework that generates robot motion videos, which are then translated into executable actions, enabling zero-shot cross-embodiment imitation without paired data or reinforcement learning. Extensive experiments on MetaWorld and real kitchen tasks demonstrate significant improvements in video realism metrics (,) and cross-embodiment success rates, as well as robust real-robot performance. The approach offers a scalable pathway for learning from unstructured human videos across diverse robot morphologies and tasks, with future work toward longer-horizon tasks and language-grounded task specifications.

Abstract

Learning manipulation skills from human demonstration videos presents a promising yet challenging problem, primarily due to the significant embodiment gap between human body and robot manipulators. Existing methods rely on paired datasets or hand-crafted rewards, which limit scalability and generalization. We propose TrajSkill, a framework for Trajectory Conditioned Cross-embodiment Skill Transfer, enabling robots to acquire manipulation skills directly from human demonstration videos. Our key insight is to represent human motions as sparse optical flow trajectories, which serve as embodiment-agnostic motion cues by removing morphological variations while preserving essential dynamics. Conditioned on these trajectories together with visual and textual inputs, TrajSkill jointly synthesizes temporally consistent robot manipulation videos and translates them into executable actions, thereby achieving cross-embodiment skill transfer. Extensive experiments are conducted, and the results on simulation data (MetaWorld) show that TrajSkill reduces FVD by 39.6\% and KVD by 36.6\% compared with the state-of-the-art, and improves cross-embodiment success rate by up to 16.7\%. Real-robot experiments in kitchen manipulation tasks further validate the effectiveness of our approach, demonstrating practical human-to-robot skill transfer across embodiments.

Paper Structure

This paper contains 24 sections, 9 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the TrajSkill from human to robot action. TrajSkill leverages sparse optical flow as a universal motion representation, achieving zero-shot imitation without reinforcement learning or paired datasets. In Stage 1, dense optical flow is extracted from human demonstrations and sampled into sparse optical flow to guide video generation. In Stage 2, the generated video is translated into robot actions using a learned policy, enabling the robot to mimic the demonstrated task.
  • Figure 2: Unified illustration of the TrajSkill framework. Top: Embodiment-Invariant Flow Sampling. From a human demonstration video frame (left), dense optical flow is computed by RAFT teed2020raft (middle), and sparse keypoint trajectories are sampled according to the flow magnitude and propagated over time (right). Middle and Bottom: Overview of the Trajectory Conditioned Robot Execution. Given a task description, the T5 model interprets the instruction, a 3D VAE extracts spatial features, and the trajectory extractor provides sparse flow signals. These are fused within a Diffusion Transformer to predict robot motion videos, which are then decoded by the policy $p(a|o,v)$ into executable actions.
  • Figure 3: Trajectory conditioned video generation. The robot arm is provided with an initial frame and a predefined trajectory, shown as red curves. TrajSkill generates a sequence of motion frames where the robot follows the specified path. The figure illustrates the robotic arm at both the starting and ending points of the trajectory.
  • Figure 4: Trajectory conditioned cross-embodiment skill transfer. Top two rows show simulation results where human demonstrations are abstracted as spherical trajectories (first row) to guide robotic arm motion generation (second row). Bottom four rows demonstrate real-world transfer from human hand demonstrations to robotic arm execution for complex multi-step tasks.
  • Figure 5: Human-Controlled Robot Video Prediction for Pick and Place Tasks. Human demonstrations (left) control robot arm movements in predicted videos (right) at three different banana positions: 20cm, 30cm, and 40cm from the basket.