Table of Contents
Fetching ...

BiGraspFormer: End-to-End Bimanual Grasp Transformer

Kangmin Kim, Seunghyeok Back, Geonhyup Lee, Sangbeom Lee, Sangjun Noh, Kyoobin Lee

Abstract

Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end transformer framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a transformer decoder, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/view/bigraspformer

BiGraspFormer: End-to-End Bimanual Grasp Transformer

Abstract

Bimanual grasping is essential for robots to handle large and complex objects. However, existing methods either focus solely on single-arm grasping or employ separate grasp generation and bimanual evaluation stages, leading to coordination problems including collision risks and unbalanced force distribution. To address these limitations, we propose BiGraspFormer, a unified end-to-end transformer framework that directly generates coordinated bimanual grasps from object point clouds. Our key idea is the Single-Guided Bimanual (SGB) strategy, which first generates diverse single grasp candidates using a transformer decoder, then leverages their learned features through specialized attention mechanisms to jointly predict bimanual poses and quality scores. This conditioning strategy reduces the complexity of the 12-DoF search space while ensuring coordinated bimanual manipulation. Comprehensive simulation experiments and real-world validation demonstrate that BiGraspFormer consistently outperforms existing methods while maintaining efficient inference speed (<0.05s), confirming the effectiveness of our framework. Code and supplementary materials are available at https://sites.google.com/view/bigraspformer

Paper Structure

This paper contains 14 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: BiGraspFormer for coordinated bimanual grasping. (Top) Comparison between single and bimanual grasping for large objects. (Bottom) BiGraspFormer successfully grasps and lifts diverse objects in real-world environments, demonstrating stable grasps across various geometries. We visualize the point cloud with the top-10 grasp predictions; the highest-scoring pair is highlighted in thick blue.
  • Figure 2: Overview of our BiGraspFormer framework. An object encoder processes point cloud $P$ to extract geometric features. Single Grasp Proposer generates force-closure single grasps, Bimanual Pair Matcher matches them using bimanual quality metrics (force stability, torque balance, dexterity) to create ground truth, and Bimanual Grasp Generator employs SGB attention for final bimanual grasp generation.
  • Figure 3: Simulation experiment settings. Left shows normal force condition where objects are grasped and lifted under gravity only. Right shows disturbance force condition where a weighted cube is dropped onto the object during lifting to apply additional external forces and test grasp robustness.
  • Figure 4: Visualization of predicted bimanual grasp poses. The top 100 bimanual grasp poses predicted by DPN-GPD, CGDF, and BiGraspFormer are shown for each object in the test set, based on simulation outcomes. The top-1 grasp pair is highlighted in blue. Green grasps represent successful grasps, while red grasps indicate failures due to instability, object collisions, or torque imbalance during grasping or lifting. Red circles highlight notable failure cases of baseline methods.
  • Figure 5: Real-World Experimental Setup. (Left) Dual-arm robotic system with two UR5e arms and two Azure Kinect RGB-D cameras, along with test objects and visualization of feasible bimanual grasps overlaid on object point clouds. (Right) Collision-free trajectory execution of selected grasp poses for grasping and lifting diverse objects.