DiffuDepGrasp: Diffusion-based Depth Noise Modeling Empowers Sim2Real Robotic Grasping
Yingting Zhou, Wenbo Cui, Weiheng Liu, Guixing Chen, Haoran Li, Dongbin Zhao
TL;DR
This work tackles the sim2real gap in depth-based robotic grasping caused by real sensor artifacts such as holes and noise. It introduces DiffuDepGrasp, a deploy-efficient zero-shot framework that uses a Diffusion Depth Generator to synthesize photorealistic noisy depth maps from pristine simulation data, guided by temporal geometric priors from a depth foundation model and refined by a Noise Grafting module that preserves geometric accuracy. A teacher policy trained with privileged information in simulation is distilled into a depth-only student policy through imitation learning, enabling deployment without online depth processing. The approach achieves a 95.7% average success rate on a 12-object grasping task and generalizes well to unseen objects, demonstrating strong practical impact for data-efficient, real-time robotic grasping in unstructured environments.
Abstract
Transferring the depth-based end-to-end policy trained in simulation to physical robots can yield an efficient and robust grasping policy, yet sensor artifacts in real depth maps like voids and noise establish a significant sim2real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipulation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim2real gap via intermediate representations fails to mitigate the domain shift fully and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy-efficient sim2real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. With only raw depth inputs during deployment, DiffuDepGrasp eliminates computational overhead and achieves a 95.7% average success rate on 12-object grasping with zero-shot transfer and strong generalization to unseen objects.Project website: https://diffudepgrasp.github.io/.
