Table of Contents
Fetching ...

XSkill: Cross Embodiment Skill Discovery

Mengda Xu, Zhenjia Xu, Cheng Chi, Manuela Veloso, Shuran Song

TL;DR

XSkill tackles cross-embodiment imitation by learning a shared skill space with prototypes from unlabeled human and robot videos, then transfers via a diffusion policy and composes unseen tasks from a one-shot prompt video using a Skill Alignment Transformer. The approach hinges on Sinkhorn-based prototype clustering and time-contrastive learning to align skills across embodiments. Evaluations in simulated and real kitchens show strong generalization to unseen task compositions and robustness to speed differences, outperforming baselines. Limitations include dependence on the number of prototypes and data diversity; future work aims to broaden datasets and camera setups.

Abstract

Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The benchmark, code, and qualitative results are on https://xskill.cs.columbia.edu/

XSkill: Cross Embodiment Skill Discovery

TL;DR

XSkill tackles cross-embodiment imitation by learning a shared skill space with prototypes from unlabeled human and robot videos, then transfers via a diffusion policy and composes unseen tasks from a one-shot prompt video using a Skill Alignment Transformer. The approach hinges on Sinkhorn-based prototype clustering and time-contrastive learning to align skills across embodiments. Evaluations in simulated and real kitchens show strong generalization to unseen task compositions and robustness to speed differences, outperforming baselines. Limitations include dependence on the number of prototypes and data diversity; future work aims to broaden datasets and camera setups.

Abstract

Human demonstration videos are a widely available data source for robot learning and an intuitive user interface for expressing desired behavior. However, directly extracting reusable robot manipulation skills from unstructured human videos is challenging due to the big embodiment difference and unobserved action parameters. To bridge this embodiment gap, this paper introduces XSkill, an imitation learning framework that 1) discovers a cross-embodiment representation called skill prototypes purely from unlabeled human and robot manipulation videos, 2) transfers the skill representation to robot actions using conditional diffusion policy, and finally, 3) composes the learned skill to accomplish unseen tasks specified by a human prompt video. Our experiments in simulation and real-world environments show that the discovered skill prototypes facilitate both skill transfer and composition for unseen tasks, resulting in a more general and scalable imitation learning framework. The benchmark, code, and qualitative results are on https://xskill.cs.columbia.edu/
Paper Structure (22 sections, 2 equations, 6 figures, 9 tables, 3 algorithms)

This paper contains 22 sections, 2 equations, 6 figures, 9 tables, 3 algorithms.

Figures (6)

  • Figure 1: Cross Embodiment Skill Discovery. XSkill first learns a cross-embodiment skill representation space (XSkill Space on the left). During inference, given a human demonstration of unseen tasks, XSkill first identifies the human skills by projecting the video demonstration onto the learned cross-embodiment skill representation space. The identified skills are then executed by the skill-conditioned visuomotor policy.
  • Figure 2: XSkill Discover: At each training iteration, a batch of video are sampled from the same embodiment dataset. Each video $v^t_i$ is augmented into two versions and encoded using temporal encoder $f_\textrm{temporal}$. The learnable skill prototypes $f_\textrm{prototype}$ are implemented as a normalized linear layer without bias. Both $f_\textrm{temporal}$ and $f_\textrm{prototype}$ are trained jointly to minimize the CorssEntorpy loss between the predicted and target the probability of skill prototypes. Sinkhorn regularization is applied to the target probability, ensuring all prototypes are used for each batch (same embodiment), thereby encouraging prototype sharing across embodiments.
  • Figure 3: Transfer & Composition: During inference, a human demonstration of a new task is given, XSkill first extracts a sequence of skills, which can be viewed as a high-level task plan. However, this plan is not immediately aligned with robot execution speed due to the embodiment gap. Therefore we need to align the plan based on the robot's current observation, which is achieved by the Skill Alignment Transformer. The inferred skills are then passed into a skill-conditioned diffusion policy to get the robot's actions.
  • Figure 4: Evaluation Environments.
  • Figure 5: XSkill embedding.(a) We utilize t-SNE visualization to showcase the alignment of skill representations among various embodiments when in contact with the same object. (b) We present projected prototypes for both humans and robots executing identical tasks. XSkill achieves efficient alignment of representations, not just during physical contact, but also during the transition between manipulating different objects.
  • ...and 1 more figures