Table of Contents
Fetching ...

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

Sungjae Park, Homanga Bharadhwaj, Shubham Tulsiani

Abstract

We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/

DemoDiffusion: One-Shot Human Imitation using pre-trained Diffusion Policy

Abstract

We propose DemoDiffusion, a simple method for enabling robots to perform manipulation tasks by imitating a single human demonstration, without requiring task-specific training or paired human-robot data. Our approach is based on two insights. First, the hand motion in a human demonstration provides a useful prior for the robot's end-effector trajectory, which we can convert into a rough open-loop robot motion trajectory via kinematic retargeting. Second, while this retargeted motion captures the overall structure of the task, it may not align well with plausible robot actions in-context. To address this, we leverage a pre-trained generalist diffusion policy to modify the trajectory, ensuring it both follows the human motion and remains within the distribution of plausible robot actions. Unlike approaches based on online reinforcement learning or paired human-robot data, our method enables robust adaptation to new tasks and scenes with minimal effort. In real-world experiments across 8 diverse manipulation tasks, DemoDiffusion achieves 83.8\% average success rate, compared to 13.8\% for the pre-trained policy and 52.5\% for kinematic retargeting, succeeding even on tasks where the pre-trained generalist policy fails entirely. Project page: https://demodiffusion.github.io/

Paper Structure

This paper contains 35 sections, 2 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 2: Retargeted human hand trajectory to closed-loop robot action sequence, for the task $\mathcal{T}$: "shut down the laptop". The dotted line shows the trajectory of robot end-effector poses after kinematic retargeting. The olive contour plot depicts the distribution of trajectories from a pre-trained diffusion policy. Given a kinematic retargeting, we first perturb it with Gaussian noise and progressively remove the noise by simulating the reverse SDE with the diffusion policy. This process gradually projects a potentially unfeasible but approximately correct retargeting to the manifold of plausible robot actions that can perform real-world manipulation, in this case closing the laptop without missing the edge.
  • Figure 3: Dexterous Grasping Results in Simulation. (Left) A human demonstration and DemoDiffusion rollout for dexterous grasping. (Right) Success rates (mean $\pm$ std over 3 seeds) as a function of diffusion step $s^*$. Here, $s^*/S=0$ corresponds to kinematic retargeting and $s^*/S=1$ corresponds to the robot policy.
  • Figure 4: Real-World Manipulation Tasks. Human demonstrations for the 8 evaluation tasks, shown as two frames per task. Tasks span prehensile and non-prehensile manipulation, including grasping, pushing, closing, wiping, and placing.
  • Figure 5: Workspace with 5 Cameras. We use the four external cameras for triangulation to obtain the global pose of the hand mesh from a human demonstration. The pre-trained policy uses the two cameras marked in purple.
  • Figure 6: Qualitative comparisons for real-robot manipulation. Rollout progressions (start, intermediate, final frame) for two tasks. Kinematic retargeting (top) produces plausible motion but loses contact before completing the task. Pi-0 (middle) performs general reaching but fails to manipulate the correct object. DemoDiffusion (bottom) reaches the object and maintains contact through task completion. See project page for full videos.
  • ...and 2 more figures