Table of Contents
Fetching ...

Learning Dual-Arm Coordination for Grasping Large Flat Objects

Yongliang Wang, Hamidreza Kasaei

TL;DR

This work tackles the challenge of grasping large flat objects that are difficult for a single arm by learning coordinated dual-arm grasp strategies. It fuses a large-scale grasp pose detector (AVN) as a visual backbone with a CNN-based PPO policy to produce dual-arm grasp points, trained entirely in simulation and deployed to real UR5e robots without fine-tuning. The approach demonstrates strong generalization to unseen object shapes, including beveled and irregular forms, and achieves high success rates in both simulated and real settings, outperforming push-to-edge baselines. The results suggest practical potential for robust dual-arm manipulation in cluttered and feature-poor environments, with future work focusing on pre-grasp maneuvers and tactile sensing to handle challenging off-center and ultra-thin objects.

Abstract

Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dual-arm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large-scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNN-based Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods.

Learning Dual-Arm Coordination for Grasping Large Flat Objects

TL;DR

This work tackles the challenge of grasping large flat objects that are difficult for a single arm by learning coordinated dual-arm grasp strategies. It fuses a large-scale grasp pose detector (AVN) as a visual backbone with a CNN-based PPO policy to produce dual-arm grasp points, trained entirely in simulation and deployed to real UR5e robots without fine-tuning. The approach demonstrates strong generalization to unseen object shapes, including beveled and irregular forms, and achieves high success rates in both simulated and real settings, outperforming push-to-edge baselines. The results suggest practical potential for robust dual-arm manipulation in cluttered and feature-poor environments, with future work focusing on pre-grasp maneuvers and tactile sensing to handle challenging off-center and ultra-thin objects.

Abstract

Grasping large flat objects, such as books or keyboards lying horizontally, presents significant challenges for single-arm robotic systems, often requiring extra actions like pushing objects against walls or moving them to the edge of a surface to facilitate grasping. In contrast, dual-arm manipulation, inspired by human dexterity, offers a more refined solution by directly coordinating both arms to lift and grasp the object without the need for complex repositioning. In this paper, we propose a model-free deep reinforcement learning (DRL) framework to enable dual-arm coordination for grasping large flat objects. We utilize a large-scale grasp pose detection model as a backbone to extract high-dimensional features from input images, which are then used as the state representation in a reinforcement learning (RL) model. A CNN-based Proximal Policy Optimization (PPO) algorithm with shared Actor-Critic layers is employed to learn coordinated dual-arm grasp actions. The system is trained and tested in Isaac Gym and deployed to real robots. Experimental results demonstrate that our policy can effectively grasp large flat objects without requiring additional maneuvers. Furthermore, the policy exhibits strong generalization capabilities, successfully handling unseen objects. Importantly, it can be directly transferred to real robots without fine-tuning, consistently outperforming baseline methods.

Paper Structure

This paper contains 23 sections, 3 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Dual-Arm Coordinated Grasp Strategy for Large Flat Objects:A demonstrates that there are no efficient grasp candidates for large flat objects such as books and keyboards. B and C show current solutions: B illustrates pushing the object against a wall to assist with grasping, while C shows pushing the object to the table edge and grasping the overhanging part. However, B's method depends on using a wall, and C is inefficient for dual-arm systems. To overcome these limitations, we propose a DRL framework to learn a cooperative dual-arm grasping strategy.
  • Figure 2: System Overview of Dual-Arm Grasping Framework: In the Isaac Gym environment, we integrate an RGB-D camera with a dual-arm UR5e robot. The camera captures RGB-D images and converts them into RGB-D top-down height maps. The RGB images are used as input observations for the framework, which employs a large-scale grasp network as the backbone to extract features. These features are fed into a CNN-based RL model trained using PPO. The output is a trained feature map, which is then decoded by a custom grasp decoder to generate executable actions within the environment. The decoder outputs two key points that determine the dual-arm grasp positions. During training, we utilize four distinct shaped objects.
  • Figure 3: Grasp Decoder for Dual-Arm Execution: The Grasp Decoder converts the trained feature map from RGB images into actionable grasp points. First, the highest predicted pixel is selected (a), then two points are aligned along one of 3 axes (b) (If two points aren't found, alternatives are chosen hierarchically). The final action points (c) guide the dual-arm robot’s grasp based on depth information.
  • Figure 4: Simulation and Real-World Scenarios: In the simulation, objects are randomly placed in different positions and orientations, and the policy is trained in parallel environments within Isaac Gym. To accelerate training, a simplified version uses two grippers instead of full arms. To evaluate generalization, three object sets are tested: common large flat objects, irregularly shaped large flat objects, and household large flat objects. In the real-world scenario, the policy is validated on two UR5e robots equipped with Robotiq 2F-140 grippers.
  • Figure 5: The training objects are divided into four categories of commonly shaped items, with each object randomly assigned a color during training.
  • ...and 6 more figures