Table of Contents
Fetching ...

Learning Dual-Arm Push and Grasp Synergy in Dense Clutter

Yongliang Wang, Hamidreza Kasaei

TL;DR

This work tackles dense-clutter robotic grasping by proposing a target-driven, dual-arm push-grasp framework trained with a CNN-based PPO policy. It combines a large-scale backbone with an Angle-View Net to output 6-DoF grasp candidates and flexible push trajectories, guided by a novel fuzzy reward that accelerates learning. The method treats push and grasp as a unified action set within a hierarchical, target-conditioned MDP, and demonstrates strong sim-to-real transfer without additional fine-tuning. Results show improved task completion, grasp success, and action efficiency over baselines in both simulation and real-world experiments, highlighting practical potential for dense clutter manipulation with dual arms.

Abstract

Robotic grasping in densely cluttered environments is challenging due to scarce collision-free grasp affordances. Non-prehensile actions can increase feasible grasps in cluttered environments, but most research focuses on single-arm rather than dual-arm manipulation. Policies from single-arm systems fail to fully leverage the advantages of dual-arm coordination. We propose a target-oriented hierarchical deep reinforcement learning (DRL) framework that learns dual-arm push-grasp synergy for grasping objects to enhance dexterous manipulation in dense clutter. Our framework maps visual observations to actions via a pre-trained deep learning backbone and a novel CNN-based DRL model, trained with Proximal Policy Optimization (PPO), to develop a dual-arm push-grasp strategy. The backbone enhances feature mapping in densely cluttered environments. A novel fuzzy-based reward function is introduced to accelerate efficient strategy learning. Our system is developed and trained in Isaac Gym and then tested in simulations and on a real robot. Experimental results show that our framework effectively maps visual data to dual push-grasp motions, enabling the dual-arm system to grasp target objects in complex environments. Compared to other methods, our approach generates 6-DoF grasp candidates and enables dual-arm push actions, mimicking human behavior. Results show that our method efficiently completes tasks in densely cluttered environments. https://sites.google.com/view/pg4da/home

Learning Dual-Arm Push and Grasp Synergy in Dense Clutter

TL;DR

This work tackles dense-clutter robotic grasping by proposing a target-driven, dual-arm push-grasp framework trained with a CNN-based PPO policy. It combines a large-scale backbone with an Angle-View Net to output 6-DoF grasp candidates and flexible push trajectories, guided by a novel fuzzy reward that accelerates learning. The method treats push and grasp as a unified action set within a hierarchical, target-conditioned MDP, and demonstrates strong sim-to-real transfer without additional fine-tuning. Results show improved task completion, grasp success, and action efficiency over baselines in both simulation and real-world experiments, highlighting practical potential for dense clutter manipulation with dual arms.

Abstract

Robotic grasping in densely cluttered environments is challenging due to scarce collision-free grasp affordances. Non-prehensile actions can increase feasible grasps in cluttered environments, but most research focuses on single-arm rather than dual-arm manipulation. Policies from single-arm systems fail to fully leverage the advantages of dual-arm coordination. We propose a target-oriented hierarchical deep reinforcement learning (DRL) framework that learns dual-arm push-grasp synergy for grasping objects to enhance dexterous manipulation in dense clutter. Our framework maps visual observations to actions via a pre-trained deep learning backbone and a novel CNN-based DRL model, trained with Proximal Policy Optimization (PPO), to develop a dual-arm push-grasp strategy. The backbone enhances feature mapping in densely cluttered environments. A novel fuzzy-based reward function is introduced to accelerate efficient strategy learning. Our system is developed and trained in Isaac Gym and then tested in simulations and on a real robot. Experimental results show that our framework effectively maps visual data to dual push-grasp motions, enabling the dual-arm system to grasp target objects in complex environments. Compared to other methods, our approach generates 6-DoF grasp candidates and enables dual-arm push actions, mimicking human behavior. Results show that our method efficiently completes tasks in densely cluttered environments. https://sites.google.com/view/pg4da/home

Paper Structure

This paper contains 26 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: An example of a target-oriented grasping task where the robot aims to grasp a green object in dense clutter. Tight packing around the target requires synergy between pushing and grasping actions. Our system maps RGBD images to actions, decides on Push or Grasp based on the current state, and plans suitable push actions (dual or single paths) to isolate the object before executing a stable 6-DoF grasp to complete the task.
  • Figure 2: Dual-Arm Push-Grasping Learning Framework Overview: In the Isaac Gym environment, the target object is highlighted in green, and an RGB-D camera integrated with a dual-arm UR5e robot captures images, converting them into top-down height maps. RGB images are input into a grasp network pre-trained on GraspNet-1 Billion, which extracts features fed into a CNN-based RL model trained with PPO. This model produces a feature map, decoded by two motion decoders to generate actions within the environment. A fuzzy reward module provides feedback and guiding training in the light blue area.
  • Figure 3: From Angle View to Final Orientation: Gripper orientation for grasping is determined by a view vector and an in-plane rotation angle. In the left section of the figure, $V$ view vectors are uniformly sampled across the upper hemisphere, while in the middle, $A$ in-plane rotation angles are sampled. Here, $V$ and $A$ are set to $60$ and $6$, respectively. The model outputs a value $(0–359)$, which is then decoded to determine the final gripper orientation.
  • Figure 4: Structure of the AVN: The network takes an RGB image as input, processed by ResNet to extract dense features, then upsampled using pixel shuffle and DUC layers to produce the AVH.
  • Figure 5: Decoders for Dual-Arm Execution: The decoders translate the feature map into actions (Sequence of timesteps from a trial). A target mask creates the grasp prediction, while an expanded target mask (1.5 times) generates the push score ①. The action with the higher score, either grasp or push, is selected. For a grasp, the target masked feature map is used; for a push, the mask is adjusted based on the largest contour radius ②. Grasping extracts translation and orientation; pushing connects key points into paths ③. The robot executes the action after motion planning and inverse kinematics ④.
  • ...and 6 more figures