Table of Contents
Fetching ...

kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation

Lucas Manuelli, Wei Gao, Peter Florence, Russ Tedrake

TL;DR

This work tackles category-level robotic manipulation where object instances vary widely in shape and topology. It introduces kPAM, which uses semantic 3D keypoints as a task-aware object representation, enabling manipulation targets to be expressed as geometric costs and constraints on keypoints. The pipeline factors perception and action into instance segmentation, 3D keypoint detection, optimization-based action planning, and dense-geometry-based execution, enabling robust generalization to never-before-seen objects. Hardware experiments with shoes and mugs demonstrate centimeter-level precision and successful category-level manipulation, highlighting the approach's interpretability and practicality for real-world robotics.

Abstract

We would like robots to achieve purposeful manipulation by placing any instance from a category of objects into a desired set of goal states. Existing manipulation pipelines typically specify the desired configuration as a target 6-DOF pose and rely on explicitly estimating the pose of the manipulated objects. However, representing an object with a parameterized transformation defined on a fixed template cannot capture large intra-category shape variation, and specifying a target pose at a category level can be physically infeasible or fail to accomplish the task -- e.g. knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. Hence we propose a novel formulation of category-level manipulation that uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. Using this formulation, we factor the manipulation policy into instance segmentation, 3D keypoint detection, optimization-based robot action planning and local dense-geometry-based action execution. This factorization allows us to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Our pipeline is robust to large intra-category shape variation and topology changes as the keypoint representation ignores task-irrelevant geometric details. Extensive hardware experiments demonstrate our method can reliably accomplish tasks with never-before seen objects in a category, such as placing shoes and mugs with significant shape variation into category level target configurations.

kPAM: KeyPoint Affordances for Category-Level Robotic Manipulation

TL;DR

This work tackles category-level robotic manipulation where object instances vary widely in shape and topology. It introduces kPAM, which uses semantic 3D keypoints as a task-aware object representation, enabling manipulation targets to be expressed as geometric costs and constraints on keypoints. The pipeline factors perception and action into instance segmentation, 3D keypoint detection, optimization-based action planning, and dense-geometry-based execution, enabling robust generalization to never-before-seen objects. Hardware experiments with shoes and mugs demonstrate centimeter-level precision and successful category-level manipulation, highlighting the approach's interpretability and practicality for real-world robotics.

Abstract

We would like robots to achieve purposeful manipulation by placing any instance from a category of objects into a desired set of goal states. Existing manipulation pipelines typically specify the desired configuration as a target 6-DOF pose and rely on explicitly estimating the pose of the manipulated objects. However, representing an object with a parameterized transformation defined on a fixed template cannot capture large intra-category shape variation, and specifying a target pose at a category level can be physically infeasible or fail to accomplish the task -- e.g. knowing the pose and size of a coffee mug relative to some canonical mug is not sufficient to successfully hang it on a rack by its handle. Hence we propose a novel formulation of category-level manipulation that uses semantic 3D keypoints as the object representation. This keypoint representation enables a simple and interpretable specification of the manipulation target as geometric costs and constraints on the keypoints, which flexibly generalizes existing pose-based manipulation methods. Using this formulation, we factor the manipulation policy into instance segmentation, 3D keypoint detection, optimization-based robot action planning and local dense-geometry-based action execution. This factorization allows us to leverage advances in these sub-problems and combine them into a general and effective perception-to-action manipulation pipeline. Our pipeline is robust to large intra-category shape variation and topology changes as the keypoint representation ignores task-irrelevant geometric details. Extensive hardware experiments demonstrate our method can reliably accomplish tasks with never-before seen objects in a category, such as placing shoes and mugs with significant shape variation into category level target configurations.

Paper Structure

This paper contains 29 sections, 3 equations, 15 figures.

Figures (15)

  • Figure 1: kPAM is a framework for defining and accomplishing category level manipulation tasks. The key distinction of kPAM is the use of semantic 3D keypoints as the object representation (a), which enables flexible specification of manipulation targets as geometric costs/constraints on keypoints. Using this framework we can handle wide intra-class shape variation (a) and reliably accomplish category-level manipulation tasks such as perceiving (b), grasping (c), and (d) placing any mug on a rack by its handle. A video demo for this task is available on our https://sites.google.com/view/kpam.
  • Figure 2: An overview of our manipulation formulation using the "put mugs upright on the table" task as an example: (a) we train a category level keypoint detector that produces two keypoints: $p_\text{bottom\_center}$ and $p_\text{top\_center}$. The axis of the mug $v_\text{mug\_axis}$ is a unit vector from $p_\text{bottom\_center}$ to $p_\text{top\_center}$. (b) Given an observed mug, its two keypoints on bottom center and top center are detected. The rigid transform $T_\text{action}$, which represents the robotic pick-and-place action, is solved to move the bottom center of the mug to the target location $p_\text{target}$ and align the mug axis with the target direction $v_\text{target\_axis}$.
  • Figure 3: An overview of the category level pick and place pipeline using our manipulation formulation. Given a RGBD image with instance segmentation, the semantic 3D keypoints of the object in question are detected. We then feed these 3D keypoints into an optimization based planning algorithm to compute the robot pick and place actions, which is represented by a rigid transformation $T_\text{action}$. Finally, we use an object-agnostic grasp planner to pick up the object and apply the computed robot action.
  • Figure 4: A pose representation cannot capture large intra-category variations. Here we show different alignment results from a shoe template (blue) to a boot observation (red). (a) and (b) are produced by gao2019filterreg with variation on the random seed, and the estimated transformation consists of a rigid pose and a global scale. In (c), the estimated transformation is a fully non-rigid deformation field in myronenko2010cpd. In these examples, the shoe template and transformations can not capture the geometry of the boot observation. Additionally, there may exist multiple suboptimal alignments which make the pose estimator ambiguous. The subsequent robotic pick and place action from these estimations are different, despite these alignments being reasonable geometrically.
  • Figure 5: A comparison of the keypoint based manipulation with pose based manipulation for two different tasks involving mugs. The first row considers the mug on rack task, where a mug must be hung on a rack by its handle. (a) Shows a reference mug in the goal state, (b) and (c) show a scaled down mug instance that could be encountered at test time. (b) uses keypoint based optimization with a constraint on the handle keypoint to find the target state for the mug. The optimized goal state successfully achieves the task of hanging the mug on the rack. In contrast (c) shows the scaled mug instance at the pose defined by (a), which leads to the handle of the mug completely missing the rack, a failure of the task. The second row shows the task of putting a mug on a table. Again (a) shows a reference mug in a goal state, (b) - (c) show a scaled up mug that could be encountered at test time. (b) uses keypoint based optimization with costs/constraints on the bottom and top keypoints to place the mug in a valid goal state. (c) directly uses the pose from (a) on the new mug instance which leads to an invalid goal state where the mug is penetrating the table.
  • ...and 10 more figures