Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Albert J. Zhai; Kuo-Hao Zeng; Jiasen Lu; Ali Farhadi; Shenlong Wang; Wei-Chiu Ma

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma

TL;DR

This work tackles learning robotic manipulation from human videos, focusing on prehensile tasks that combine grasping with post-grasp motions. It introduces Perceive-Simulate-Imitate (PSI), a three-step framework that (1) perceives $6$-DoF object poses from videos, (2) simulates and filters grasp-trajectory pairs to label task-compatibility, and (3) imitates via an open-loop visuomotor policy trained with staged losses. PSI enables task-oriented grasping without any robot demonstrations, achieving robust real-world performance and improving sample efficiency, especially when pretrained on HOI4D data. The approach generalizes across multiple robot embodiments and supports integration with existing grasp generators, offering a scalable path toward cross-embodiment manipulation learning with minimal robot data.

Abstract

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

TL;DR

-DoF object poses from videos, (2) simulates and filters grasp-trajectory pairs to label task-compatibility, and (3) imitates via an open-loop visuomotor policy trained with staged losses. PSI enables task-oriented grasping without any robot demonstrations, achieving robust real-world performance and improving sample efficiency, especially when pretrained on HOI4D data. The approach generalizes across multiple robot embodiments and supports integration with existing grasp generators, offering a scalable path toward cross-embodiment manipulation learning with minimal robot data.

Abstract

Paper Structure (29 sections, 8 figures, 8 tables)

This paper contains 29 sections, 8 figures, 8 tables.

Introduction
Related Work
Learning manipulation skills from human videos
Simulation-based filtering for robot learning
Task-oriented grasping
Method
Overview
Object 6-DoF Pose as Motion Representation
6-DoF Pose vs. Flow
6-DoF Pose Estimation
Trajectory and Grasp Filtering via Simulation
Policy Learning
Policy Execution
Experiments
Experiment Setup
...and 14 more sections

Figures (8)

Figure 1: Modular prehensile imitation learning. Human videos are well-suited for learning post-grasp motions but are not suitable for learning grasping for non-anthropomorphic end-effectors. Separating these subtasks via a modular policy design allows for dedicated post-grasp learning. However, existing methods under this paradigm fail to acquire task-compatible grasping skills and are not robust to poor-quality motion data. We propose a simple but effective solution to these issues using simulation-based filtering and a learned grasp scoring model.
Figure 2: Task-compatibility for grasps. Even though a grasp may be stable, it may not be compatible with the downstream task. With a firm right hand underhand grip on the door handle (right), it becomes very difficult to turn the handle clockwise. Task-agnostic grasp generators fall short in solving this problem, highlighting the need for task-oriented grasping.
Figure 3: Overview of our framework. PSI is a three-step framework for visual imitation learning using only RGB-D videos of human demonstrations. In the Perceive step, we use 3D vision techniques to track the 6D pose of active objects in videos (Sec. \ref{['sec:pstep']}). The 6D pose trajectories can be directly translated to end-effector actions on a robot. In the Simulate step, we leverage simulation to refine and enhance the pose trajectory data with grasp suitability labels (Sec. \ref{['sec:sstep']}). In the Imitate step, we train an open-loop visuomotor policy on the data via behavior cloning (Sec. \ref{['sec:istep']}). The policy can be combined with any existing grasp generator to perform task-oriented grasping and manipulation on a real robot (Sec. \ref{['sec:execution']}).
Figure 4: Modular task-oriented grasping. PSI exploits existing models for grasp stability, while achieving task-compatibility via a scoring model trained on simulation data. The scoring model is run first to produce scores for a set of canonical anchor grasps. It can then be combined with any grasp generator in a modular manner by assigning candidate grasps to their nearest anchor grasps.
Figure 5: Qualitative 6D pose tracking results. We experiment with a model-based method (FoundationPose) and a model-free method (ICP + Refinement) for 6D pose tracking. Each RGB image shows the scene at the end of a trajectory. Object bounding boxes transformed using the tracked 6D pose from 8 timesteps are overlaid onto the image. We observe that FoundationPose provides slightly more accurate tracking, but the ICP pipeline still performs satisfactorily for task success.
...and 3 more figures

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

TL;DR

Abstract

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (8)