Table of Contents
Fetching ...

Learning Extrinsic Dexterity with Parameterized Manipulation Primitives

Shih-Min Yang, Martin Magnusson, Johannes A. Stork, Todor Stoyanov

TL;DR

This work tackles occluded grasping by introducing ED-PMP, a hierarchical reinforcement learning framework that sequences parameterized manipulation primitives and learns a low-level controller for a contact-rich flip primitive. The high-level policy uses depth perception to select among push, flip, and grasp primitives, while the low-level policy learns effective flip actions, enabling extrusion of dexterity via environment interactions. A curriculum learning strategy paired with automatic domain randomization enables zero-shot transfer from simulation to a real robot, achieving up to 98% success in real-world box grabbing across varied objects and configurations. The approach reduces the need for manually designed primitives and object pose estimators, offering a scalable pathway for extrinsic dexterity in cluttered or occluded settings with simple grippers.

Abstract

Many practically relevant robot grasping problems feature a target object for which all grasps are occluded, e.g., by the environment. Single-shot grasp planning invariably fails in such scenarios. Instead, it is necessary to first manipulate the object into a configuration that affords a grasp. We solve this problem by learning a sequence of actions that utilize the environment to change the object's pose. Concretely, we employ hierarchical reinforcement learning to combine a sequence of learned parameterized manipulation primitives. By learning the low-level manipulation policies, our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment. Designing such a complex behavior analytically would be infeasible under uncontrolled conditions, as an analytic approach requires accurate physical modeling of the interaction and contact dynamics. In contrast, we learn a hierarchical policy model that operates directly on depth perception data, without the need for object detection, pose estimation, or manual design of controllers. We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace. Our method transfers to a real robot and is able to successfully complete the object picking task in 98\% of experimental trials. Supplementary information and videos can be found at https://shihminyang.github.io/ED-PMP/.

Learning Extrinsic Dexterity with Parameterized Manipulation Primitives

TL;DR

This work tackles occluded grasping by introducing ED-PMP, a hierarchical reinforcement learning framework that sequences parameterized manipulation primitives and learns a low-level controller for a contact-rich flip primitive. The high-level policy uses depth perception to select among push, flip, and grasp primitives, while the low-level policy learns effective flip actions, enabling extrusion of dexterity via environment interactions. A curriculum learning strategy paired with automatic domain randomization enables zero-shot transfer from simulation to a real robot, achieving up to 98% success in real-world box grabbing across varied objects and configurations. The approach reduces the need for manually designed primitives and object pose estimators, offering a scalable pathway for extrinsic dexterity in cluttered or occluded settings with simple grippers.

Abstract

Many practically relevant robot grasping problems feature a target object for which all grasps are occluded, e.g., by the environment. Single-shot grasp planning invariably fails in such scenarios. Instead, it is necessary to first manipulate the object into a configuration that affords a grasp. We solve this problem by learning a sequence of actions that utilize the environment to change the object's pose. Concretely, we employ hierarchical reinforcement learning to combine a sequence of learned parameterized manipulation primitives. By learning the low-level manipulation policies, our approach can control the object's state through exploiting interactions between the object, the gripper, and the environment. Designing such a complex behavior analytically would be infeasible under uncontrolled conditions, as an analytic approach requires accurate physical modeling of the interaction and contact dynamics. In contrast, we learn a hierarchical policy model that operates directly on depth perception data, without the need for object detection, pose estimation, or manual design of controllers. We evaluate our approach on picking box-shaped objects of various weight, shape, and friction properties from a constrained table-top workspace. Our method transfers to a real robot and is able to successfully complete the object picking task in 98\% of experimental trials. Supplementary information and videos can be found at https://shihminyang.github.io/ED-PMP/.
Paper Structure (13 sections, 2 equations, 5 figures, 3 tables)

This paper contains 13 sections, 2 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Top: In the initial pose, all feasible grasps on the target object are occluded by the environment. Bottom (left to right): We learn to push the object to a wall and exploit it as a pivot to flip the object up and finally grasp it from the top.
  • Figure 2: Overview. Our ED-PMP method aims to break down complex tasks into sub-tasks and reduces the need for manual primitive design. It comprises high-level and low-level agents. High-Level Agent (top): The high-level agent takes a height map as input to a DQN, implemented using an FCN model. It then outputs pixel-wise maps of Q values, where each pixel corresponds to a starting pose and a primitive. Low-Level Agent (down): The low-level agent combines the current end-effector pose and contact force as the state of a DQN model. It iteratively estimates a series of actions to accomplish the sub-task within a designated number of iterations, denoted as $T$.
  • Figure 3: The end-effector displacement $(d, z, \theta_y)$ corresponds to action space of the low-level agent, where $P_s$ is the starting pose and $p_{a^l}$ is the current end-effector task-space pose.
  • Figure 4: Testing curve of success rate versus training episodes of the high-level model in simulation. (a) The completion rate for full-task success (successfully picking the object with 10 or fewer primitives). (b) The success rate for the grasp primitive. (c) The success rate for the flip primitive. Success rates of the primitives are computed over the last 100 attempts.
  • Figure 5: An example sequence of picking up a flat object (in simulation). From top to bottom, each row represents a sequence of decisions made by the high-level agent. The left column shows the current observation in the form of a height map, while the right column contains the estimated Q maps for each of the three primitives across 16 orientations. The maximum Q value and corresponding height map pixel in each step are marked with red arrows.