Table of Contents
Fetching ...

Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization

Boce Hu, Heng Tian, Dian Wang, Haojie Huang, Xupeng Zhu, Robin Walters, Robert Platt

TL;DR

The Equivariant Push-Grasp Network is proposed, a novel framework for joint pushing and grasping policy learning that improves grasp success rates by 45% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.

Abstract

Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the rich geometric structures inherent in such tasks, thus limiting their effectiveness in complex, heavily cluttered scenarios. To address this, we propose the Equivariant Push-Grasp Network, a novel framework for joint pushing and grasping policy learning. Our contributions are twofold: (1) leveraging SE(2)-equivariance to improve both pushing and grasping performance and (2) a grasp score optimization-based training strategy that simplifies the joint learning process. Experimental results show that our method improves grasp success rates by 49% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.

Push-Grasp Policy Learning Using Equivariant Models and Grasp Score Optimization

TL;DR

The Equivariant Push-Grasp Network is proposed, a novel framework for joint pushing and grasping policy learning that improves grasp success rates by 45% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.

Abstract

Goal-conditioned robotic grasping in cluttered environments remains a challenging problem due to occlusions caused by surrounding objects, which prevent direct access to the target object. A promising solution to mitigate this issue is combining pushing and grasping policies, enabling active rearrangement of the scene to facilitate target retrieval. However, existing methods often overlook the rich geometric structures inherent in such tasks, thus limiting their effectiveness in complex, heavily cluttered scenarios. To address this, we propose the Equivariant Push-Grasp Network, a novel framework for joint pushing and grasping policy learning. Our contributions are twofold: (1) leveraging SE(2)-equivariance to improve both pushing and grasping performance and (2) a grasp score optimization-based training strategy that simplifies the joint learning process. Experimental results show that our method improves grasp success rates by 49% in simulation and by 35% in real-world scenarios compared to strong baselines, representing a significant advancement in push-grasp policy learning.

Paper Structure

This paper contains 17 sections, 3 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Illustration of the Push-Grasp Workflow. The target object, specified by human instruction, is highlighted with a red mask (e.g., a banana). At each step, the push action direction is represented by an arrow. Our method iteratively predicts and executes push actions to create sufficient space for grasping the target. The final grasp pose is shown as a blue rectangle, with green blocks indicating the gripper's fingers.
  • Figure 2: Given an RGB-D observation, SAM2ravi2024sam2 generates a set of object masks. GraspNet and PushNet then use the depth image and these masks to predict candidate grasp and push actions. The target object's grasp pose is filtered using its corresponding mask, and the best candidate is selected. Finally, CriticNet evaluates the selected grasp pose against a threshold $\tau$ to determine whether to execute the grasp or a push action.
  • Figure 3: PushNet Training and CriticNet Finetuning Pipeline. The push reward is derived from the Grasp Imagination Module: it is 1 if the imagined grasp succeeds, otherwise it equals the difference in predicted grasp scores before and after the push.
  • Figure 4: Illustration of how an element $g$ acts on feature maps by rotating the pixels and permuting the order of the channels. The angles above feature maps indicate the candidate grasp orientations.
  • Figure 5: PushNet Structure. In the graph, the target node (red star) connects to nearby nodes within a predefined distance threshold (blue circle). Green edges are valid connections, while red edges are invalid.
  • ...and 3 more figures