Table of Contents
Fetching ...

DAGDiff: Guiding Dual-Arm Grasp Diffusion to Stable and Collision-Free Grasps

Md Faizal Karim, Vignesh Vembar, Keshab Patra, Gaurav Singh, K Madhava Krishna

TL;DR

DAGDiff addresses reliable dual-arm grasping by generating grasp pairs directly in $SE(3) \times SE(3)$ using a diffusion process guided by force-closure and collision signals. It introduces an energy-based model $E_\alpha$ with score $s_\alpha$ and uses dual-arm mappings via $\operatorname{Logmap_2}$ and $\operatorname{Expmap_2}$ to enable end-to-end learning, augmented by two classifier-guidance modules for stability and collision avoidance. Training optimizes $\mathcal{L}_{\text{diff}}$, $\mathcal{L}_{\text{fc}}$, and $\mathcal{L}_{\text{col}}$, while inference incorporates gradient guidance and a collision-refinement threshold $t_c$ to produce low-energy, physically valid dual-arm grasps. Experiments on DG16M and Isaac Gym demonstrate substantial improvements over baselines, and zero-shot real-world tests on unseen objects confirm practical transfer to real sensor data with heterogeneous dual-arm hardware.

Abstract

Reliable dual-arm grasping is essential for manipulating large and complex objects but remains a challenging problem due to stability, collision, and generalization requirements. Prior methods typically decompose the task into two independent grasp proposals, relying on region priors or heuristics that limit generalization and provide no principled guarantee of stability. We propose DAGDiff, an end-to-end framework that directly denoises to grasp pairs in the SE(3) x SE(3) space. Our key insight is that stability and collision can be enforced more effectively by guiding the diffusion process with classifier signals, rather than relying on explicit region detection or object priors. To this end, DAGDiff integrates geometry-, stability-, and collision-aware guidance terms that steer the generative process toward grasps that are physically valid and force-closure compliant. We comprehensively evaluate DAGDiff through analytical force-closure checks, collision analysis, and large-scale physics-based simulations, showing consistent improvements over previous work on these metrics. Finally, we demonstrate that our framework generates dual-arm grasps directly on real-world point clouds of previously unseen objects, which are executed on a heterogeneous dual-arm setup where two manipulators reliably grasp and lift them.

DAGDiff: Guiding Dual-Arm Grasp Diffusion to Stable and Collision-Free Grasps

TL;DR

DAGDiff addresses reliable dual-arm grasping by generating grasp pairs directly in using a diffusion process guided by force-closure and collision signals. It introduces an energy-based model with score and uses dual-arm mappings via and to enable end-to-end learning, augmented by two classifier-guidance modules for stability and collision avoidance. Training optimizes , , and , while inference incorporates gradient guidance and a collision-refinement threshold to produce low-energy, physically valid dual-arm grasps. Experiments on DG16M and Isaac Gym demonstrate substantial improvements over baselines, and zero-shot real-world tests on unseen objects confirm practical transfer to real sensor data with heterogeneous dual-arm hardware.

Abstract

Reliable dual-arm grasping is essential for manipulating large and complex objects but remains a challenging problem due to stability, collision, and generalization requirements. Prior methods typically decompose the task into two independent grasp proposals, relying on region priors or heuristics that limit generalization and provide no principled guarantee of stability. We propose DAGDiff, an end-to-end framework that directly denoises to grasp pairs in the SE(3) x SE(3) space. Our key insight is that stability and collision can be enforced more effectively by guiding the diffusion process with classifier signals, rather than relying on explicit region detection or object priors. To this end, DAGDiff integrates geometry-, stability-, and collision-aware guidance terms that steer the generative process toward grasps that are physically valid and force-closure compliant. We comprehensively evaluate DAGDiff through analytical force-closure checks, collision analysis, and large-scale physics-based simulations, showing consistent improvements over previous work on these metrics. Finally, we demonstrate that our framework generates dual-arm grasps directly on real-world point clouds of previously unseen objects, which are executed on a heterogeneous dual-arm setup where two manipulators reliably grasp and lift them.

Paper Structure

This paper contains 12 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: We introduce DAGDiff: Dual-Arm Grasp Diffusion, a diffusion framework in $SE(3) \times SE(3)$ that takes an object point cloud and denoises noisy grasp pairs (in shades of red) into stable, collision-free dual-arm grasps (in shades of green), guided by multi-head outputs. These predicted grasps are further validated through real-world dual-arm executions, where objects are grasped and lifted successfully.
  • Figure 2: Overview of the proposed method:(a) Given an object point cloud $P$, our network encodes geometric features into dense feature maps. Next, randomly initialized dual-arm grasps $H$ are used to transform a fixed query cloud into query points, followed by feature sampling through bilinear interpolation. Conditioned on the noise step $t$, these features are passed through $F_\theta$, which predicts both the SDF of the query points and a feature vector. This vector is used by three output heads that predict energy $(E_\alpha)$, force-closure probability $(C_\beta^{\text{fc}})$, and collision probability $(C_\gamma^{\text{col}})$, jointly guiding the diffusion process. (b) At inference, denoising proceeds from random initializations ($t=250$) to refined grasps ($t=0$). The energy head drives the generative dynamics, while the force-closure and collision heads bias the generation until stable, collision-free dual-arm grasps emerge.
  • Figure 3: Qualitative comparison of dual-arm grasps. VCGS often fails due to poor grasp generation and its farthest-region heuristic. CGDF shares the latter limitation, for example, on the bucket, one gripper cannot reach the bottom corner. UniDiffGrasp relies on GPT-4V and VLPart for semantic segmentation, but when region labels are ambiguous (e.g., last two rows), it defaults to naive splits, yielding unstable grasps. RoboBrainGrasp predicts keypoints or bounding boxes, yet the keypoints are frequently misaligned (e.g., inside the stool) and the boxes too coarse for precise grasping. In contrast, our method directly generates stable, collision-free dual-arm grasps by reasoning over physical constraints, without heuristics or vague semantic cues. (Red circles indicate grasps that either collide or fail to contact the object. RoboBrainGrasp-KP and -BB are referred to jointly as RoboBrainGrasp.)
  • Figure 4: Real-world experimental setup. Dual-arm system with an XArm7 and XArm6 Lite, calibrated using an AprilTag and observed by two RealSense D455 cameras.