Spatial RoboGrasp: Generalized Robotic Grasping Control Policy
Yiqi Huang, Travis Davies, Jiahuan Yan, Jiankai Sun, Xiang Chen, Luhui Hu
TL;DR
This work tackles the robustness and generalization gaps in robotic grasping by integrating a spatially grounded perception stack with a diffusion-based policy. Spatial RoboGrasp combines AugFusion domain-randomized RGB augmentations, monocular depth estimation, and 6-DoF grasp prompts to produce a rich observation embedding that conditions a stochastic, contact-aware controller. Ablation and task-based evaluations across PickBig, PickCup, and PickGoods demonstrate substantial improvements in task and grasp success under environmental variation, with depth, augmentation, and grasp prompts each contributing uniquely to performance. The approach promises scalable, real-world deployment by avoiding heavy 3D sensing while delivering reliable, goal-directed manipulation in unstructured settings.
Abstract
Achieving generalizable and precise robotic manipulation across diverse environments remains a critical challenge, largely due to limitations in spatial perception. While prior imitation-learning approaches have made progress, their reliance on raw RGB inputs and handcrafted features often leads to overfitting and poor 3D reasoning under varied lighting, occlusion, and object conditions. In this paper, we propose a unified framework that couples robust multimodal perception with reliable grasp prediction. Our architecture fuses domain-randomized augmentation, monocular depth estimation, and a depth-aware 6-DoF Grasp Prompt into a single spatial representation for downstream action planning. Conditioned on this encoding and a high-level task prompt, our diffusion-based policy yields precise action sequences, achieving up to 40% improvement in grasp success and 45% higher task success rates under environmental variation. These results demonstrate that spatially grounded perception, paired with diffusion-based imitation learning, offers a scalable and robust solution for general-purpose robotic grasping.
