Table of Contents
Fetching ...

Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation

Siddhant Haldar, Lerrel Pinto

TL;DR

Point Policy introduces a point-based representation to learn robot manipulation policies exclusively from offline human demonstration videos, eliminating the need for robot teleoperation data. By extracting 3D hand and object key points via two-view triangulation and semantic correspondence, a transformer-based policy predicts future robot key points, which are back-mapped to 6-DOF end-effector actions using rigid-body geometry. The approach achieves strong in-domain performance, robust generalization to novel object instances, and resilience to background clutter across eight real-world tasks, outperforming baselines by large margins. The work highlights the viability of leveraging vision-model priors for cross-morphology policy learning and sets the stage for further improvements via depth sensing and object priors, while acknowledging limitations in vision system reliability and scene context retention.

Abstract

Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos and without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Our experiments on 8 real-world tasks demonstrate an overall 75% absolute improvement over prior works when evaluated in identical settings as training. Further, Point Policy exhibits a 74% gain across tasks for novel object instances and is robust to significant background clutter. Videos of the robot are best viewed at https://point-policy.github.io/.

Point Policy: Unifying Observations and Actions with Key Points for Robot Manipulation

TL;DR

Point Policy introduces a point-based representation to learn robot manipulation policies exclusively from offline human demonstration videos, eliminating the need for robot teleoperation data. By extracting 3D hand and object key points via two-view triangulation and semantic correspondence, a transformer-based policy predicts future robot key points, which are back-mapped to 6-DOF end-effector actions using rigid-body geometry. The approach achieves strong in-domain performance, robust generalization to novel object instances, and resilience to background clutter across eight real-world tasks, outperforming baselines by large margins. The work highlights the viability of leveraging vision-model priors for cross-morphology policy learning and sets the stage for further improvements via depth sensing and object priors, while acknowledging limitations in vision system reliability and scene context retention.

Abstract

Building robotic agents capable of operating across diverse environments and object types remains a significant challenge, often requiring extensive data collection. This is particularly restrictive in robotics, where each data point must be physically executed in the real world. Consequently, there is a critical need for alternative data sources for robotics and frameworks that enable learning from such data. In this work, we present Point Policy, a new method for learning robot policies exclusively from offline human demonstration videos and without any teleoperation data. Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through semantically meaningful key points. This approach yields a morphology-agnostic representation that facilitates effective policy learning. Our experiments on 8 real-world tasks demonstrate an overall 75% absolute improvement over prior works when evaluated in identical settings as training. Further, Point Policy exhibits a 74% gain across tasks for novel object instances and is robust to significant background clutter. Videos of the robot are best viewed at https://point-policy.github.io/.

Paper Structure

This paper contains 52 sections, 6 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: We present Point Policy, a framework that unifies robot observations and actions with key points and enables learning robot policies exclusively from human videos. Point Policy enables learning policies with improved generalization capabilities, including spatial generalization (i.e. generalization to new locations), generalization to novel object instances, and robustness to background distractors.
  • Figure 2: Overview of the Point Policy framework. (a) Point Policy leverages state-of-the-art vision models and policy architectures to translate human hand poses into robot poses while capturing object states through sparse single-frame human annotations. (b) The derived key points are fed into a transformer policy to predict the 3D future point tracks from which the robot actions are computed through rigid-body geometry constraints. (c) Finally, the computed action is executed on the robot using end-effector position control at a 6Hz frequency.
  • Figure 3: Results of the correspondence model when used for the put bottle on rack and sweep broom tasks. On the left is a frame with human annotations for the object points. On the right, we show that semantic correspondence can identify the same points across different positions, new object instances, and background clutter.
  • Figure 4: (left) Illustration of spatial variation used in our experiments. (right) Range of objects used in our experiments, where the objects on the left are in-domain objects while on the right are unseen objects used in our generalization experiments.
  • Figure 5: Real-world rollouts showing Point Policy's ability on in-domain objects across 8 real-world tasks.
  • ...and 4 more figures