Table of Contents
Fetching ...

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

Justin Kerr, Chung Min Kim, Mingxuan Wu, Brent Yi, Qianqian Wang, Ken Goldberg, Angjoo Kanazawa

TL;DR

Robot See Robot Do (RSRD) enables zero-shot imitation of articulated object manipulation from a single monocular demonstration by building a part-aware 4D model from a static multi-view object scan and tracking part motion with differentiable rendering. The core method, 4D-Differentiable Part Models (4D-DPM), embeds DINO-based feature fields into a 3D Gaussian Splatting representation and optimizes part trajectories via analysis-by-synthesis under priors like mono-depth and ARAP. The recovered object trajectories are used to plan bimanual robot grasps and end-to-end motions that reproduce the demonstrated object motion, without copying hand motion or task-specific training. Experiments on nine objects show robust object pose registration, feasible grasping, and substantial end-to-end imitation performance, illustrating practical generalization across object orientations and robot morphologies using only pretrained vision features. These results highlight the potential of object-centric, feature-field based approaches for naturalistic and scalable robot learning from humans.

Abstract

Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io

Robot See Robot Do: Imitating Articulated Object Manipulation with Monocular 4D Reconstruction

TL;DR

Robot See Robot Do (RSRD) enables zero-shot imitation of articulated object manipulation from a single monocular demonstration by building a part-aware 4D model from a static multi-view object scan and tracking part motion with differentiable rendering. The core method, 4D-Differentiable Part Models (4D-DPM), embeds DINO-based feature fields into a 3D Gaussian Splatting representation and optimizes part trajectories via analysis-by-synthesis under priors like mono-depth and ARAP. The recovered object trajectories are used to plan bimanual robot grasps and end-to-end motions that reproduce the demonstrated object motion, without copying hand motion or task-specific training. Experiments on nine objects show robust object pose registration, feasible grasping, and substantial end-to-end imitation performance, illustrating practical generalization across object orientations and robot morphologies using only pretrained vision features. These results highlight the potential of object-centric, feature-field based approaches for naturalistic and scalable robot learning from humans.

Abstract

Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. Project page: https://robot-see-robot-do.github.io
Paper Structure (32 sections, 6 figures, 2 tables)

This paper contains 32 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Robot See Robot Do. To visually imitate articulated object motion RSRD first reconstructs a part-aware feature field. Given an input demonstration video, we then track the object part motion using the feature field. Next, the robot recognizes the object in its workspace and plans a bimanual trajectory to achieve the demonstrated object motion.
  • Figure 2: 4D Reconstruction of Articulated Objects. Keyframes from the motion trajectories overlaid over monocular RGB demonstrations with parts colorized, and along with two viewpoints.
  • Figure 3: 4D Differentiable Part Models (4D-DPM). Left: DINO features and depth are rendered from per-timestep optimizable part pose parameters, and compared with extracted DINO features and monocular depth from the input frame. Right: an ARAP loss penalizes gaussians from deviating too far from their initial configuration with respect to neighbors. Together these losses flow backwards into the part poses and are optimized with gradient descent to recover 3D part motion.
  • Figure 4: Hand Alignment: RSRD uses HaMeR pavlakos2024reconstructing to detect and align human hand poses to the demonstrations. Detections are used to rank part pairs for grasping (Sec \ref{['sec:robot-do']}).
  • Figure 5: ARAP Ablation. ARAP is a simple but effective prior for improving 3D motion recovery by preventing small or under-observed parts from drifting.
  • ...and 1 more figures