Table of Contents
Fetching ...

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

Maximus A. Pace, Prithwish Dan, Chuanruo Ning, Atiksh Bhardwaj, Audrey Du, Edward W. Duan, Wei-Chiu Ma, Kushal Kedia

TL;DR

X-Diffusion addresses leveraging large-scale cross-embodiment human demonstrations for training diffusion policies without producing dynamically infeasible robot motions. It introduces a per-action noised-action classifier that determines the minimum indistinguishability step $k^\star$ and selectively includes human data in the diffusion training only for $k \ge k^\star$, preserving robot feasibility while leveraging human signal. Empirical results across five manipulation tasks show a $16\%$ average improvement over the best cross-embodiment baselines, highlighting the effectiveness of selective integration. The approach enables scalable use of human demonstrations for robot manipulation by balancing data diversity with physical feasibility.

Abstract

Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

X-Diffusion: Training Diffusion Policies on Cross-Embodiment Human Demonstrations

TL;DR

X-Diffusion addresses leveraging large-scale cross-embodiment human demonstrations for training diffusion policies without producing dynamically infeasible robot motions. It introduces a per-action noised-action classifier that determines the minimum indistinguishability step and selectively includes human data in the diffusion training only for , preserving robot feasibility while leveraging human signal. Empirical results across five manipulation tasks show a average improvement over the best cross-embodiment baselines, highlighting the effectiveness of selective integration. The approach enables scalable use of human demonstrations for robot manipulation by balancing data diversity with physical feasibility.

Abstract

Human videos can be recorded quickly and at scale, making them an appealing source of training data for robot learning. However, humans and robots differ fundamentally in embodiment, resulting in mismatched action execution. Direct kinematic retargeting of human hand motion can therefore produce actions that are physically infeasible for robots. Despite these low-level differences, human demonstrations provide valuable motion cues about how to manipulate and interact with objects. Our key idea is to exploit the forward diffusion process: as noise is added to actions, low-level execution differences fade while high-level task guidance is preserved. We present X-Diffusion, a principled framework for training diffusion policies that maximally leverages human data without learning dynamically infeasible motions. X-Diffusion first trains a classifier to predict whether a noisy action is executed by a human or robot. Then, a human action is incorporated into policy training only after adding sufficient noise such that the classifier cannot discern its embodiment. Actions consistent with robot execution supervise fine-grained denoising at low noise levels, while mismatched human actions provide only coarse guidance at higher noise levels. Our experiments show that naive co-training under execution mismatches degrades policy performance, while X-Diffusion consistently improves it. Across five manipulation tasks, X-Diffusion achieves a 16% higher average success rate than the best baseline. The project website is available at https://portal-cornell.github.io/X-Diffusion/.

Paper Structure

This paper contains 26 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Overview of X-Diffusion: We introduce X-Diffusion, a framework to train diffusion policies on cross-embodiment human data containing a variety of execution styles. Naively co-training diffusion policies on human and robot datasets with mismatched dynamics can lead the denoising process to output dynamically infeasible actions for the robot, degrading performance below standard robot-only diffusion policy training. Instead, X-Diffusion trains a classifier to distinguish between noised human and robot actions, and integrates noised human actions into policy training only when the classifier is unsure of which embodiment produced the actions, thus, effectively learning from large and diverse human demonstrations.
  • Figure 2: Pipeline:X-Diffusion first unifies the state and action representation. State is represented by a colored segmentation mask of relevant objects using Grounded-SAM2 ravi2024sam2segmentimages. Action is represented via end-effector/human hand pose utilizing HaMeR Pavlakos2023ReconstructingHI for retargeting. During the policy's forward diffusion process, Gaussian noise is sampled and added to the clean actions. To determine if the policy should learn to denoise noisy human actions into robot actions, X-Diffusion utilizes a classifier trained to distinguish the source embodiment of noised actions. Actions are only included for training the denoising process if the classifier is fooled into thinking it's from a robot. Thus, we learn from broad human data without learning infeasible actions.
  • Figure 3: Visualizing Actions under Noise and Classifier Predictions at various Diffusion Steps. Humans execute tasks in various ways. For example, when picking and placing a pan, a human can either execute a top-down grasp or a side grasp. Human actions that are feasible for robots (e.g. top-down grasp) overlap with robot action distribution under low noise timesteps. This data fools the classifier into believing it could have been executed by a robot, so we include it in the diffusion denoising process during policy training. In contrast, human actions that are kinematically and dynamically infeasible for robots (e.g. side grasp) are accurately identified as human actions by the classifier until significantly more noise is added in the forward diffusion process, restricting their impact on policy learning to only supervise coarse guidance at high noise.
  • Figure 4: Performance vs. Baselines: We report task success rate on 5 different manipulation tasks and compare X-Diffusion against a robot-only baseline (Diffusion Policy) and various co-training baselines (Point-Policy, MotionTracks). DemoDiffusion is another diffusion-based method, but it doesn't train the robot policy on human demonstrations. We find that X-Diffusion is the highest performing model on all tasks, effectively incorporating human action data into its training recipe even when execution styles are mismatched. One human and robot demonstration is visualized for each task.
  • Figure 5: Naive co-training learns infeasible robot actions: Including all human data in policy training can incentivize policies to learn strategies demonstrated by humans but infeasible for robots. On multiple tasks, a human may manipulate objects in ways that are not realizable for a robot.
  • ...and 1 more figures