Table of Contents
Fetching ...

3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

Adam Hung, Bardienus Pieter Duisterhof, Jeffrey Ichnowski

TL;DR

3PoinTr is proposed, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans, and it is found that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations.

Abstract

Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.

3PoinTr: 3D Point Tracks for Robot Manipulation Pretraining from Casual Videos

TL;DR

3PoinTr is proposed, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans, and it is found that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations.

Abstract

Data-efficient training of robust robot policies is the key to unlocking automation in a wide array of novel tasks. Current systems require large volumes of demonstrations to achieve robustness, which is impractical in many applications. Learning policies directly from human videos is a promising alternative that removes teleoperation costs, but it shifts the challenge toward overcoming the embodiment gap (differences in kinematics and strategies between robots and humans), often requiring restrictive and carefully choreographed human motions. We propose 3PoinTr, a method for pretraining robot policies from casual and unconstrained human videos, enabling learning from motions natural for humans. 3PoinTr uses a transformer architecture to predict 3D point tracks as an intermediate embodiment-agnostic representation. 3D point tracks encode goal specifications, scene geometry, and spatiotemporal relationships. We use a Perceiver IO architecture to extract a compact representation for sample-efficient behavior cloning, even when point tracks violate downstream embodiment-specific constraints. We conduct thorough evaluation on simulated and real-world tasks, and find that 3PoinTr achieves robust spatial generalization on diverse categories of manipulation tasks with only 20 action-labeled robot demonstrations. 3PoinTr outperforms the baselines, including behavior cloning methods, as well as prior methods for pretraining from human videos. We also provide evaluations of 3PoinTr's 3D point track predictions compared to an existing point track prediction baseline. We find that 3PoinTr produces more accurate and higher quality point tracks due to a lightweight yet expressive architecture built on a single transformer, in addition to a training formulation that preserves supervision of partially occluded points. Project page: https://adamhung60.github.io/3PoinTr/.
Paper Structure (28 sections, 4 equations, 5 figures, 3 tables)

This paper contains 28 sections, 4 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: 3PoinTr is a general and scalable method for pretraining manipulation policies with casual human videos. Given an observed point cloud, 3PoinTr answers: how will the scene evolve when completing the task? We contribute a state-of-the-art 3D point track prediction transformer, and use a Perceiver IO architecture and Diffusion Policy to enable state-of-the-art imitation learning.
  • Figure 2: Diagram of the 3PoinTr network architecture. We first encode an initial point cloud, pass through a transformer decoder, and project each point token to a 3D point trajectory. This yields dense 3D point tracks that encode goal specifications, scene geometry, and spatiotemporal relationships. We then aggregate the per-point trajectory features using a Perceiver IO-style cross-attention module. A small set of learned query tokens attends to the full set of point track tokens, producing a compact global representation of the task. This representation is used as conditioning input to a Diffusion Policy, which generates an open-loop sequence of robot actions.
  • Figure 3: Visualizations of the simulation and real-world tasks we evaluate 3PoinTr on. The upper-left section shows renderings of simulation environments, where robot trajectories are procedurally generated, and the resulting data is used for both video pretraining as well as behavior cloning. The right section shows images from real-world data collection, where video pretraining trains on casual human demonstration videos, and behavior cloning trains on teleoperated robot demonstration data. We display 2D projections of the 3D point tracks extracted from the videos in green, representing the training targets for our point track prediction networks. The bottom-left section shows a few examples of the random initial configurations we use for evaluation, demonstrating the spatial generalization required for policies to succeed.
  • Figure 4: Simulation task success rate vs. number of robot demonstrations. The plot shows how success rate changes when we vary the number of action-labeled robot demonstrations. Results are averaged across the three simulation tasks.
  • Figure 5: We visualize a few examples of how we learn from casual videos. The left column shows the human and object motion in the casual human videos, while the right column shows the robot and object motion in the robot demonstrations. 3PoinTr first learns to predict object motions from human videos (red arrows on the left), and then learns to map those motions to robot actions (blue arrows on the right). As shown in these examples, neither the embodiment motions nor the object motions need to closely match between the two datasets.