OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

Jinhan Li; Yifeng Zhu; Yuqi Xie; Zhenyu Jiang; Mingyo Seo; Georgios Pavlakos; Yuke Zhu

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

Jinhan Li, Yifeng Zhu, Yuqi Xie, Zhenyu Jiang, Mingyo Seo, Georgios Pavlakos, Yuke Zhu

TL;DR

The paper tackles open-world imitation from a single RGB-D video to teach dexterous manipulation for humanoid robots. It introduces OKAMI, a two-stage framework that first derives a reference plan from the video using open-world vision and motion reconstruction, and then performs object-aware retargeting to adapt body and hand trajectories to new object locations via IK. A key contribution is decoupling body motion retargeting from hand pose adaptation, enabling robust generalization across varied object layouts and backgrounds, with vision models like GPT-4V guiding object identification. Empirical results show strong generalization on real hardware and simulation, achieving a 71.7% average task success across six tasks and enabling visuomotor policy learning from OKAMI rollouts to reach 79.2% average success, significantly reducing the need for labor-intensive teleoperation.

Abstract

We study the problem of teaching humanoid robots manipulation skills by imitating from single video demonstrations. We introduce OKAMI, a method that generates a manipulation plan from a single RGB-D video and derives a policy for execution. At the heart of our approach is object-aware retargeting, which enables the humanoid robot to mimic the human motions in an RGB-D video while adjusting to different object locations during deployment. OKAMI uses open-world vision models to identify task-relevant objects and retarget the body motions and hand poses separately. Our experiments show that OKAMI achieves strong generalizations across varying visual and spatial conditions, outperforming the state-of-the-art baseline on open-world imitation from observation. Furthermore, OKAMI rollout trajectories are leveraged to train closed-loop visuomotor policies, which achieve an average success rate of 79.2% without the need for labor-intensive teleoperation. More videos can be found on our website https://ut-austin-rpl.github.io/OKAMI/.

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

TL;DR

Abstract

OKAMI: Teaching Humanoid Robots Manipulation Skills through Single Video Imitation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)