Table of Contents
Fetching ...

HACMan++: Spatially-Grounded Motion Primitives for Manipulation

Bowen Jiang, Yilin Wu, Wenxuan Zhou, Chris Paxton, David Held

TL;DR

HACMan++ tackles generalization gaps in robotic manipulation by introducing spatially grounded, parameterized motion primitives within a discrete-continuous action space. The method learns to chain diverse primitives—grounded at per-point locations with task-specific parameters—via a hybrid TD3-style actor-critic framework that produces per-point Q-values (Critic Maps) to select actions. Across ManiSkill, Robosuite, and a challenging DoubleBin task, HACMan++ outperforms non-grounded baselines and demonstrates strong generalization to unseen objects, with successful zero-shot sim-to-real transfer in real-world tests achieving 73% success. This approach enables robust long-horizon manipulation by coupling temporal abstraction with precise spatial reasoning, offering practical impact for scalable robotic manipulation in varied geometries.

Abstract

Although end-to-end robot learning has shown some success for robot manipulation, the learned policies are often not sufficiently robust to variations in object pose or geometry. To improve the policy generalization, we introduce spatially-grounded parameterized motion primitives in our method HACMan++. Specifically, we propose an action representation consisting of three components: what primitive type (such as grasp or push) to execute, where the primitive will be grounded (e.g. where the gripper will make contact with the world), and how the primitive motion is executed, such as parameters specifying the push direction or grasp orientation. These three components define a novel discrete-continuous action space for reinforcement learning. Our framework enables robot agents to learn to chain diverse motion primitives together and select appropriate primitive parameters to complete long-horizon manipulation tasks. By grounding the primitives on a spatial location in the environment, our method is able to effectively generalize across object shape and pose variations. Our approach significantly outperforms existing methods, particularly in complex scenarios demanding both high-level sequential reasoning and object generalization. With zero-shot sim-to-real transfer, our policy succeeds in challenging real-world manipulation tasks, with generalization to unseen objects. Videos can be found on the project website: https://sgmp-rss2024.github.io.

HACMan++: Spatially-Grounded Motion Primitives for Manipulation

TL;DR

HACMan++ tackles generalization gaps in robotic manipulation by introducing spatially grounded, parameterized motion primitives within a discrete-continuous action space. The method learns to chain diverse primitives—grounded at per-point locations with task-specific parameters—via a hybrid TD3-style actor-critic framework that produces per-point Q-values (Critic Maps) to select actions. Across ManiSkill, Robosuite, and a challenging DoubleBin task, HACMan++ outperforms non-grounded baselines and demonstrates strong generalization to unseen objects, with successful zero-shot sim-to-real transfer in real-world tests achieving 73% success. This approach enables robust long-horizon manipulation by coupling temporal abstraction with precise spatial reasoning, offering practical impact for scalable robotic manipulation in varied geometries.

Abstract

Although end-to-end robot learning has shown some success for robot manipulation, the learned policies are often not sufficiently robust to variations in object pose or geometry. To improve the policy generalization, we introduce spatially-grounded parameterized motion primitives in our method HACMan++. Specifically, we propose an action representation consisting of three components: what primitive type (such as grasp or push) to execute, where the primitive will be grounded (e.g. where the gripper will make contact with the world), and how the primitive motion is executed, such as parameters specifying the push direction or grasp orientation. These three components define a novel discrete-continuous action space for reinforcement learning. Our framework enables robot agents to learn to chain diverse motion primitives together and select appropriate primitive parameters to complete long-horizon manipulation tasks. By grounding the primitives on a spatial location in the environment, our method is able to effectively generalize across object shape and pose variations. Our approach significantly outperforms existing methods, particularly in complex scenarios demanding both high-level sequential reasoning and object generalization. With zero-shot sim-to-real transfer, our policy succeeds in challenging real-world manipulation tasks, with generalization to unseen objects. Videos can be found on the project website: https://sgmp-rss2024.github.io.
Paper Structure (36 sections, 5 equations, 16 figures, 4 tables)

This paper contains 36 sections, 5 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 2: Our method processes a point cloud to estimate a set of per-point primitive parameters $a_i^m$ for each point $x_i$ in the point cloud and for each primitive in our primitive set. We then compute a set of "Critic Maps" (one per primitive) which estimate the Q-value $Q_{i,k}$ of using each primitive $k$, grounded at each point $x_i$, and parameterized by the estimated primitive parameters $a_i^m$. We either sample from the Critic Map (during training) or choose the point and primitive with the highest score (during evaluation) for robot execution.
  • Figure 3: We evaluate our method on multiple object manipulation tasks that require picking, placing, and poking objects. From left to right, we show the six simulation tasks: ManiSkill2 Lift Cube, ManiSkill2 Stack Cube, ManiSkill2 Peg Insertion, Robosuite Pick-and-Place, Robosuite Door Opening, and a customized Robosuite DoubleBin environment. We also show our real-world experiment setup which mimics the DoubleBin simulation environment.
  • Figure 4: Performance of our method compared to baselines RAPS dalal2021accelerating and P-DQN xiong2018parametrized on six different tasks. For all the ManiSkill tasks and Robosuite tasks, we report the success rate averaged over 20 trials. For DoubleBin tasks, we report the average success rate over 32 objects, each tested with 70 trials. These baselines use the same skill primitives as our approach but they are not spatially grounded, e.g. they do not ground the primitives on a point selected by the policy from the observed point cloud.
  • Figure 5: A simulation rollout of our policy. The goal is shown in the top right, and also overlayed on each observation. At each step, we visualize the scores that we assign to each of the primitives. We also visualize the selected primitive location and parameters (orange arrow). As shown, our method learns to chain a sequence of grounded primitives to accomplish a challenging long-horizon manipulation task.
  • Figure 6: Success rate as a function of the episode length, for training objects (all), training objects from categories with many instances (common categories), unseen instances from those same common categories, and for unseen object classes. We train with an episode length of 10 but evaluate with varying episode lengths up to 30.
  • ...and 11 more figures