Table of Contents
Fetching ...

Learning to Imitate Object Interactions from Internet Videos

Austin Patel, Andrew Wang, Ilija Radosavovic, Jitendra Malik

TL;DR

This work tackles learning to imitate object interactions from unconstrained internet videos by first reconstructing $4$D hand-object trajectories (RHOV) from monocular RGB data and then using reinforcement learning to reproduce the trajectories in a physics simulator with a flexible, object-centric policy. RHOV combines hand pose estimation, differentiable rendering for object pose, and temporally coherent optimization with multiple losses to produce smooth, dynamic trajectories. The imitation system treats the recovered trajectory as a dense, 6-DoF goal and trains a robot policy (PPO) to minimize pose and velocity errors, enabling a robotic arm with a parallel jaw gripper to mimic diverse interactions. Across 100 online videos and standard in-lab datasets, the approach shows strong qualitative reconstructions and competitive quantitative results, with ablations highlighting the value of temporal consistency and 2D mask quality for accurate 3D reconstructions and imitations.

Abstract

We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories of both the hand and the object using 2D image cues and temporal smoothness constraints; (2) a system for imitating object interactions in a physics simulator with reinforcement learning. We apply our reconstruction technique to 100 challenging Internet videos. We further show that we can successfully imitate a range of different object interactions in a physics simulator. Our object-centric approach is not limited to human-like end-effectors and can learn to imitate object interactions using different embodiments, like a robotic arm with a parallel jaw gripper.

Learning to Imitate Object Interactions from Internet Videos

TL;DR

This work tackles learning to imitate object interactions from unconstrained internet videos by first reconstructing D hand-object trajectories (RHOV) from monocular RGB data and then using reinforcement learning to reproduce the trajectories in a physics simulator with a flexible, object-centric policy. RHOV combines hand pose estimation, differentiable rendering for object pose, and temporally coherent optimization with multiple losses to produce smooth, dynamic trajectories. The imitation system treats the recovered trajectory as a dense, 6-DoF goal and trains a robot policy (PPO) to minimize pose and velocity errors, enabling a robotic arm with a parallel jaw gripper to mimic diverse interactions. Across 100 online videos and standard in-lab datasets, the approach shows strong qualitative reconstructions and competitive quantitative results, with ablations highlighting the value of temporal consistency and 2D mask quality for accurate 3D reconstructions and imitations.

Abstract

We study the problem of imitating object interactions from Internet videos. This requires understanding the hand-object interactions in 4D, spatially in 3D and over time, which is challenging due to mutual hand-object occlusions. In this paper we make two main contributions: (1) a novel reconstruction technique RHOV (Reconstructing Hands and Objects from Videos), which reconstructs 4D trajectories of both the hand and the object using 2D image cues and temporal smoothness constraints; (2) a system for imitating object interactions in a physics simulator with reinforcement learning. We apply our reconstruction technique to 100 challenging Internet videos. We further show that we can successfully imitate a range of different object interactions in a physics simulator. Our object-centric approach is not limited to human-like end-effectors and can learn to imitate object interactions using different embodiments, like a robotic arm with a parallel jaw gripper.
Paper Structure (14 sections, 14 equations, 8 figures, 5 tables)

This paper contains 14 sections, 14 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Imitating object interactions from internet videos. We study the problem of imitating object interactions from Internet videos. We present an approach that is able to imitate a range of different object interactions. Please see our https://austinapatel.github.io/imitate-video for video results.
  • Figure 2: Approach. We present an approach for imitating object interactions from Internet videos. We first reconstruct hand-object trajectories in 4D (see §\ref{['sec:rhov']}). We then use the recovered trajectories to imitate the object motion with a robot in a physics simulator (see §\ref{['sec:rl']}).
  • Figure 3: Reconstructing hands and objects from videos. We present an optimization-based technique for reconstructing hands and objects from videos, leveraging spatial image cues (keypoints, masks, depth) and temporal smoothness constraints (4D, optical flow).
  • Figure 4: RHOV, qualitative results. We show example reconstructions from RHOV. Each 4D reconstruction is shown from six views.
  • Figure 5: Imitating object interactions. We frame the task of imitating the observed object interaction as an RL problem. The recovered 4D object trajectory enables us to compute dense rewards, based on the object pose distance, and train a policy to imitate it.
  • ...and 3 more figures