Table of Contents
Fetching ...

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, Yuke Zhu

TL;DR

The paper tackles open-world imitation from a single RGB-D video to teach dexterous manipulation for humanoid robots. It introduces OKAMI, a two-stage framework that first derives a reference plan from the video using open-world vision and motion reconstruction, and then performs object-aware retargeting to adapt body and hand trajectories to new object locations via IK. A key contribution is decoupling body motion retargeting from hand pose adaptation, enabling robust generalization across varied object layouts and backgrounds, with vision models like GPT-4V guiding object identification. Empirical results show strong generalization on real hardware and simulation, achieving a 71.7% average task success across six tasks and enabling visuomotor policy learning from OKAMI rollouts to reach 79.2% average success, significantly reducing the need for labor-intensive teleoperation.

Abstract

We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website https://ut-austin-rpl.github.io/OKAMI/.

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

TL;DR

The paper tackles open-world imitation from a single RGB-D video to teach dexterous manipulation for humanoid robots. It introduces OKAMI, a two-stage framework that first derives a reference plan from the video using open-world vision and motion reconstruction, and then performs object-aware retargeting to adapt body and hand trajectories to new object locations via IK. A key contribution is decoupling body motion retargeting from hand pose adaptation, enabling robust generalization across varied object layouts and backgrounds, with vision models like GPT-4V guiding object identification. Empirical results show strong generalization on real hardware and simulation, achieving a 71.7% average task success across six tasks and enabling visuomotor policy learning from OKAMI rollouts to reach 79.2% average success, significantly reducing the need for labor-intensive teleoperation.

Abstract

We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website https://ut-austin-rpl.github.io/OKAMI/.

Paper Structure

This paper contains 23 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: OKAMI enables a human user to teach the humanoid robot how to perform a new task by providing a single video demonstration.
  • Figure 2: Overview of OKAMI. OKAMI is a two-staged method that enables a humanoid robot to imitate a manipulation task from a single human video. In the first stage, OKAMI generates a reference plan using GPT-4V and large vision models for subsequent manipulation. In the second stage, OKAMI follows the reference plan, where it retargets human motions onto the humanoid with object awareness. The retargeted motions are converted into a sequence of robot joint commands for the robot to follow.
  • Figure 3: Visualization of initial and final frames of both human demonstrations and robot rollouts for all tasks.
  • Figure 4: (a) Evaluation of OKAMI over all six tasks, including the success rates and the quantification of failed trials, separated by failure mode. (b) Evaluation of OKAMI using videos from different demonstrations. Demonstrator 1 is the main person recording videos for all evaluations in (a).
  • Figure 5: Success rates of learned visuomotor policies on Sprinkle-salt and Bagging using 50 and 100 trajectories, respectively.
  • ...and 2 more figures