Table of Contents
Fetching ...

Variational Shape Inference for Grasp Diffusion on SE(3)

S. Talha Bukhari, Kaivalya Agrawal, Zachary Kingston, Aniket Bera

TL;DR

The paper tackles multimodal grasp synthesis under noisy and partial 3D observations by learning a robust latent shape prior via a variational autoencoder conditioned on implicit representations, then guiding SE(3) diffusion-based grasp generation with these shape features. A test-time optimization plugin adds differentiable objectives to refine grasp poses, improving stability and collision avoidance during inference. Empirical results on the ACRONYM dataset show a 6.3% improvement over state-of-the-art, strong robustness to sparse point clouds, and a successful zero-shot sim-to-real transfer with substantially more successful grasps than baselines. The work demonstrates practical gains for robotic manipulation by marrying geometry-aware priors with diffusion models and offers open-source code for reproducibility and further research.

Abstract

Grasp synthesis is a fundamental task in robotic manipulation which usually has multiple feasible solutions. Multimodal grasp synthesis seeks to generate diverse sets of stable grasps conditioned on object geometry, making the robust learning of geometric features crucial for success. To address this challenge, we propose a framework for learning multimodal grasp distributions that leverages variational shape inference to enhance robustness against shape noise and measurement sparsity. Our approach first trains a variational autoencoder for shape inference using implicit neural representations, and then uses these learned geometric features to guide a diffusion model for grasp synthesis on the SE(3) manifold. Additionally, we introduce a test-time grasp optimization technique that can be integrated as a plugin to further enhance grasping performance. Experimental results demonstrate that our shape inference for grasp synthesis formulation outperforms state-of-the-art multimodal grasp synthesis methods on the ACRONYM dataset by 6.3%, while demonstrating robustness to deterioration in point cloud density compared to other approaches. Furthermore, our trained model achieves zero-shot transfer to real-world manipulation of household objects, generating 34% more successful grasps than baselines despite measurement noise and point cloud calibration errors.

Variational Shape Inference for Grasp Diffusion on SE(3)

TL;DR

The paper tackles multimodal grasp synthesis under noisy and partial 3D observations by learning a robust latent shape prior via a variational autoencoder conditioned on implicit representations, then guiding SE(3) diffusion-based grasp generation with these shape features. A test-time optimization plugin adds differentiable objectives to refine grasp poses, improving stability and collision avoidance during inference. Empirical results on the ACRONYM dataset show a 6.3% improvement over state-of-the-art, strong robustness to sparse point clouds, and a successful zero-shot sim-to-real transfer with substantially more successful grasps than baselines. The work demonstrates practical gains for robotic manipulation by marrying geometry-aware priors with diffusion models and offers open-source code for reproducibility and further research.

Abstract

Grasp synthesis is a fundamental task in robotic manipulation which usually has multiple feasible solutions. Multimodal grasp synthesis seeks to generate diverse sets of stable grasps conditioned on object geometry, making the robust learning of geometric features crucial for success. To address this challenge, we propose a framework for learning multimodal grasp distributions that leverages variational shape inference to enhance robustness against shape noise and measurement sparsity. Our approach first trains a variational autoencoder for shape inference using implicit neural representations, and then uses these learned geometric features to guide a diffusion model for grasp synthesis on the SE(3) manifold. Additionally, we introduce a test-time grasp optimization technique that can be integrated as a plugin to further enhance grasping performance. Experimental results demonstrate that our shape inference for grasp synthesis formulation outperforms state-of-the-art multimodal grasp synthesis methods on the ACRONYM dataset by 6.3%, while demonstrating robustness to deterioration in point cloud density compared to other approaches. Furthermore, our trained model achieves zero-shot transfer to real-world manipulation of household objects, generating 34% more successful grasps than baselines despite measurement noise and point cloud calibration errors.

Paper Structure

This paper contains 18 sections, 6 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 2: Pipeline of our proposed approach. We first learn variational shape encodings by training a VAE-style architecture that reconstructs planar shape features for conditioning SDF MLP queries (first row). The learned shape encodings then condition the grasp diffuser by querying point cloud representations of gripper poses. DSM denotes the Denoising Score Matching objective and $\mathrm{L}_1$ denotes the $\mathrm{L}_1$ norm penalty. $PC_o$ and $PC_g$ denote object and gripper-attached point clouds, respectively. Blue and red arrows indicate training procedures for the shape inference and grasp diffusion stages, respectively.
  • Figure 3: Objective functions for test-time pose optimization: grasp pinch center alignment $\mathcal{L}_c$ (left) and neural SDF of the gripper $\mathcal{L}_{\Omega_G}$ (right). These differentiable objectives guide generated grasps toward more stable configurations during inference.
  • Figure 4: Distribution of performance metrics. Our method demonstrates the most consistent grasp performance across test objects while maintaining competitive grasp diversity.
  • Figure 5: Performance evaluation on partial point clouds. Our method demonstrates the most consistent grasp performance while maintaining diversity across test objects despite using single-view measurements only.
  • Figure 6: Qualitative comparison of generated grasps for Laptop (top row), Donut (middle row), and Pencil (bottom row). Dark cyan indicates successful grasps while dark purple indicates failures. Compared to baselines, our method generates more successful and stable grasps across diverse object geometries.
  • ...and 2 more figures