Table of Contents
Fetching ...

TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation

Chuer Pan, Brian Okorn, Harry Zhang, Ben Eisner, David Held

TL;DR

The paper tackles enabling robots to manipulate unseen objects by focusing on cross-pose, a task-specific relative pose between interacting objects, defined as $\mathbf{T}_{\mathcal{A}\mathcal{B}} = \mathbf{T}_{\mathcal{B}} \mathbf{T}_{\mathcal{A}}^{-1}$ in $SE(3)$. It introduces TAX-Pose, a vision-based system that predicts dense, soft cross-object correspondences via dual DGCNN encoders and a cross-object attention transformer, then resolves a single cross-pose through a differentiable weighted SVD on corrected correspondences. The method demonstrates strong generalization to novel objects and configurations across NDF mug-hanging and PartNet-Mobility placement tasks, often requiring only a few real-world demonstrations, and shows real-world transfer without finetuning. By combining task-specific cross-pose estimation with a downstream motion planner, TAX-Pose advances robust relative-placement capabilities for manipulation in varied environments, while also highlighting limitations such as the need for segmentation and sensitivity to occlusion. Overall, this work provides a data-efficient, translation-equivariant approach to geometric reasoning for manipulation, enabling more reliable skill transfer across object instances within a category.

Abstract

How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship "cross-pose" and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.

TAX-Pose: Task-Specific Cross-Pose Estimation for Robot Manipulation

TL;DR

The paper tackles enabling robots to manipulate unseen objects by focusing on cross-pose, a task-specific relative pose between interacting objects, defined as in . It introduces TAX-Pose, a vision-based system that predicts dense, soft cross-object correspondences via dual DGCNN encoders and a cross-object attention transformer, then resolves a single cross-pose through a differentiable weighted SVD on corrected correspondences. The method demonstrates strong generalization to novel objects and configurations across NDF mug-hanging and PartNet-Mobility placement tasks, often requiring only a few real-world demonstrations, and shows real-world transfer without finetuning. By combining task-specific cross-pose estimation with a downstream motion planner, TAX-Pose advances robust relative-placement capabilities for manipulation in varied environments, while also highlighting limitations such as the need for segmentation and sensitivity to occlusion. Overall, this work provides a data-efficient, translation-equivariant approach to geometric reasoning for manipulation, enabling more reliable skill transfer across object instances within a category.

Abstract

How do we imbue robots with the ability to efficiently manipulate unseen objects and transfer relevant skills based on demonstrations? End-to-end learning methods often fail to generalize to novel objects or unseen configurations. Instead, we focus on the task-specific pose relationship between relevant parts of interacting objects. We conjecture that this relationship is a generalizable notion of a manipulation task that can transfer to new objects in the same category; examples include the relationship between the pose of a pan relative to an oven or the pose of a mug relative to a mug rack. We call this task-specific pose relationship "cross-pose" and provide a mathematical definition of this concept. We propose a vision-based system that learns to estimate the cross-pose between two objects for a given manipulation task using learned cross-object correspondences. The estimated cross-pose is then used to guide a downstream motion planner to manipulate the objects into the desired pose relationship (placing a pan into the oven or the mug onto the mug rack). We demonstrate our method's capability to generalize to unseen objects, in some cases after training on only 10 demonstrations in the real world. Results show that our system achieves state-of-the-art performance in both simulated and real-world experiments across a number of tasks. Supplementary information and videos can be found at https://sites.google.com/view/tax-pose/home.
Paper Structure (44 sections, 40 equations, 14 figures, 14 tables)

This paper contains 44 sections, 40 equations, 14 figures, 14 tables.

Figures (14)

  • Figure 1: To solve a relative placement task, TAX-Pose uses cross-object attention to estimate dense cross-object correspondences and importance weights for each object point. This dense estimate is mapped to a single "cross-pose" which the robot uses to accomplish the given task.
  • Figure 2: We study relative placement tasks, in which one object needs to be placed in a position relative to another object. Here are two of the tasks that we demonstrate our method on: Top:PartNet-Mobility Placement Task requires one object (e.g. a block) to be placed relative to another object (e.g. a drawer) by a semantic goal position (e.g. inside); Bottom:Mug Hanging Task requires placing the mug's handle on the mug rack.
  • Figure 3: If we transform both the action object (mug) and the anchor object (rack) by the same transform, then the relative pose between these objects is unchanged (the mug is still "on" the rack) so the mug is still in the goal configuration.
  • Figure 4: TAX-Pose Training Overview: Given a specific task, our method takes as input two point clouds and outputs the cross-pose between them needed to achieve the task. TAX-Pose first learns point clouds features using two DGCNN phan2018dgcnn networks and two Transformers vaswani2017attention. Then the learned features are each input to a point residual network to predict per-point soft correspondences and weights across the two objects. The desired cross-pose can be inferred analytically from these correspondences using singular value decomposition.
  • Figure 5: Real-world experiments summary. Left: In object placement task, we train using simulated demonstrations and test on real-world objects. Right: Mug Hanging real-world experiments. We train from just 10 demonstrations from 10 training mugs in the real world and test on 10 unseen test mugs.
  • ...and 9 more figures