Table of Contents
Fetching ...

MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

Dayou Li, Chenkun Zhao, Shuo Yang, Ran Song, Xiaolei Li, Wei Zhang

TL;DR

MPGNet tackles target-oriented grasping in occluded scenes by introducing a three-branch architecture that simultaneously learns moving, pushing, and grasping actions. A multi-stage training regime stabilizes learning and enables effective coordination among branches, achieving superior performance in both simulation and real-world tests compared with strong baselines. The work demonstrates rapid convergence, high grasping success, and efficient action usage, and it validates sim-to-real transfer without fine-tuning. Additionally, it highlights the potential for human-guidance or multimodal integration to further enhance occluded-object grasping in practical settings.

Abstract

This paper focuses on target-oriented grasping in occluded scenes, where the target object is specified by a binary mask and the goal is to grasp the target object with as few robotic manipulations as possible. Most existing methods rely on a push-grasping synergy to complete this task. To deliver a more powerful target-oriented grasping pipeline, we present MPGNet, a three-branch network for learning a synergy between moving, pushing, and grasping actions. We also propose a multi-stage training strategy to train the MPGNet which contains three policy networks corresponding to the three actions. The effectiveness of our method is demonstrated via both simulated and real-world experiments.

MPGNet: Learning Move-Push-Grasping Synergy for Target-Oriented Grasping in Occluded Scenes

TL;DR

MPGNet tackles target-oriented grasping in occluded scenes by introducing a three-branch architecture that simultaneously learns moving, pushing, and grasping actions. A multi-stage training regime stabilizes learning and enables effective coordination among branches, achieving superior performance in both simulation and real-world tests compared with strong baselines. The work demonstrates rapid convergence, high grasping success, and efficient action usage, and it validates sim-to-real transfer without fine-tuning. Additionally, it highlights the potential for human-guidance or multimodal integration to further enhance occluded-object grasping in practical settings.

Abstract

This paper focuses on target-oriented grasping in occluded scenes, where the target object is specified by a binary mask and the goal is to grasp the target object with as few robotic manipulations as possible. Most existing methods rely on a push-grasping synergy to complete this task. To deliver a more powerful target-oriented grasping pipeline, we present MPGNet, a three-branch network for learning a synergy between moving, pushing, and grasping actions. We also propose a multi-stage training strategy to train the MPGNet which contains three policy networks corresponding to the three actions. The effectiveness of our method is demonstrated via both simulated and real-world experiments.
Paper Structure (22 sections, 8 equations, 8 figures, 1 table)

This paper contains 22 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: Through the synergy of moving, pushing, and grasping actions, the robotic arm based on the proposed MPGNet can efficiently grasp the target object in occluded scenes.
  • Figure 2: Overview of MPGNet. The input data of MPGNet are obtained by an RGB-D camera from a top-down view. The heightmaps are rotated at 16 angles to predict different motion orientations and fed into MPGNet. Move-net, grasp-net, and push-net work collaboratively to grasp target objects in occluded scenes. The move-net is designed to remove occluding objects to make the target as graspable as possible. When there is no occlusion in the workspace, the pushing action assists in grasping the target object.
  • Figure 3: We train MPGNet in the simulated environment and then transfer it to the real world.
  • Figure 4: Learning curves of different methods.
  • Figure 5: Visualization of the Q-maps corresponding to the three primitive actions produced by MPGNet.
  • ...and 3 more figures