Learning to Imitate Object Interactions from Internet Videos
Austin Patel, Andrew Wang, Ilija Radosavovic, Jitendra Malik
TL;DR
This work tackles learning to imitate object interactions from unconstrained internet videos by first reconstructing $4$D hand-object trajectories (RHOV) from monocular RGB data and then using reinforcement learning to reproduce the trajectories in a physics simulator with a flexible, object-centric policy. RHOV combines hand pose estimation, differentiable rendering for object pose, and temporally coherent optimization with multiple losses to produce smooth, dynamic trajectories. The imitation system treats the recovered trajectory as a dense, 6-DoF goal and trains a robot policy (PPO) to minimize pose and velocity errors, enabling a robotic arm with a parallel jaw gripper to mimic diverse interactions. Across 100 online videos and standard in-lab datasets, the approach shows strong qualitative reconstructions and competitive quantitative results, with ablations highlighting the value of temporal consistency and 2D mask quality for accurate 3D reconstructions and imitations.
Abstract
We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories of both the hand and the object using 2D image cues and temporal smoothness constraints; (2) a system for imitating object interactions in a physics simulator with reinforcement learning. We apply our reconstruction technique to 100 challenging Internet videos. We further show that we can successfully imitate a range of different object interactions in a physics simulator. Our object-centric approach is not limited to human-like end-effectors and can learn to imitate object interactions using different embodiments, like a robotic arm with a parallel jaw gripper.
