Variational Shape Inference for Grasp Diffusion on SE(3)
S. Talha Bukhari, Kaivalya Agrawal, Zachary Kingston, Aniket Bera
TL;DR
The paper tackles multimodal grasp synthesis under noisy and partial 3D observations by learning a robust latent shape prior via a variational autoencoder conditioned on implicit representations, then guiding SE(3) diffusion-based grasp generation with these shape features. A test-time optimization plugin adds differentiable objectives to refine grasp poses, improving stability and collision avoidance during inference. Empirical results on the ACRONYM dataset show a 6.3% improvement over state-of-the-art, strong robustness to sparse point clouds, and a successful zero-shot sim-to-real transfer with substantially more successful grasps than baselines. The work demonstrates practical gains for robotic manipulation by marrying geometry-aware priors with diffusion models and offers open-source code for reproducibility and further research.
Abstract
Grasp synthesis is a fundamental task in robotic manipulation which usually has multiple feasible solutions. Multimodal grasp synthesis seeks to generate diverse sets of stable grasps conditioned on object geometry, making the robust learning of geometric features crucial for success. To address this challenge, we propose a framework for learning multimodal grasp distributions that leverages variational shape inference to enhance robustness against shape noise and measurement sparsity. Our approach first trains a variational autoencoder for shape inference using implicit neural representations, and then uses these learned geometric features to guide a diffusion model for grasp synthesis on the SE(3) manifold. Additionally, we introduce a test-time grasp optimization technique that can be integrated as a plugin to further enhance grasping performance. Experimental results demonstrate that our shape inference for grasp synthesis formulation outperforms state-of-the-art multimodal grasp synthesis methods on the ACRONYM dataset by 6.3%, while demonstrating robustness to deterioration in point cloud density compared to other approaches. Furthermore, our trained model achieves zero-shot transfer to real-world manipulation of household objects, generating 34% more successful grasps than baselines despite measurement noise and point cloud calibration errors.
