Table of Contents
Fetching ...

DiffuDepGrasp: Diffusion-based Depth Noise Modeling Empowers Sim2Real Robotic Grasping

Yingting Zhou, Wenbo Cui, Weiheng Liu, Guixing Chen, Haoran Li, Dongbin Zhao

TL;DR

This work tackles the sim2real gap in depth-based robotic grasping caused by real sensor artifacts such as holes and noise. It introduces DiffuDepGrasp, a deploy-efficient zero-shot framework that uses a Diffusion Depth Generator to synthesize photorealistic noisy depth maps from pristine simulation data, guided by temporal geometric priors from a depth foundation model and refined by a Noise Grafting module that preserves geometric accuracy. A teacher policy trained with privileged information in simulation is distilled into a depth-only student policy through imitation learning, enabling deployment without online depth processing. The approach achieves a 95.7% average success rate on a 12-object grasping task and generalizes well to unseen objects, demonstrating strong practical impact for data-efficient, real-time robotic grasping in unstructured environments.

Abstract

Transferring the depth-based end-to-end policy trained in simulation to physical robots can yield an efficient and robust grasping policy, yet sensor artifacts in real depth maps like voids and noise establish a significant sim2real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipulation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim2real gap via intermediate representations fails to mitigate the domain shift fully and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy-efficient sim2real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. With only raw depth inputs during deployment, DiffuDepGrasp eliminates computational overhead and achieves a 95.7% average success rate on 12-object grasping with zero-shot transfer and strong generalization to unseen objects.Project website: https://diffudepgrasp.github.io/.

DiffuDepGrasp: Diffusion-based Depth Noise Modeling Empowers Sim2Real Robotic Grasping

TL;DR

This work tackles the sim2real gap in depth-based robotic grasping caused by real sensor artifacts such as holes and noise. It introduces DiffuDepGrasp, a deploy-efficient zero-shot framework that uses a Diffusion Depth Generator to synthesize photorealistic noisy depth maps from pristine simulation data, guided by temporal geometric priors from a depth foundation model and refined by a Noise Grafting module that preserves geometric accuracy. A teacher policy trained with privileged information in simulation is distilled into a depth-only student policy through imitation learning, enabling deployment without online depth processing. The approach achieves a 95.7% average success rate on a 12-object grasping task and generalizes well to unseen objects, demonstrating strong practical impact for data-efficient, real-time robotic grasping in unstructured environments.

Abstract

Transferring the depth-based end-to-end policy trained in simulation to physical robots can yield an efficient and robust grasping policy, yet sensor artifacts in real depth maps like voids and noise establish a significant sim2real gap that critically impedes policy transfer. Training-time strategies like procedural noise injection or learned mappings suffer from data inefficiency due to unrealistic noise simulation, which is often ineffective for grasping tasks that require fine manipulation or dependency on paired datasets heavily. Furthermore, leveraging foundation models to reduce the sim2real gap via intermediate representations fails to mitigate the domain shift fully and adds computational overhead during deployment. This work confronts dual challenges of data inefficiency and deployment complexity. We propose DiffuDepGrasp, a deploy-efficient sim2real framework enabling zero-shot transfer through simulation-exclusive policy training. Its core innovation, the Diffusion Depth Generator, synthesizes geometrically pristine simulation depth with learned sensor-realistic noise via two synergistic modules. The first Diffusion Depth Module leverages temporal geometric priors to enable sample-efficient training of a conditional diffusion model that captures complex sensor noise distributions, while the second Noise Grafting Module preserves metric accuracy during perceptual artifact injection. With only raw depth inputs during deployment, DiffuDepGrasp eliminates computational overhead and achieves a 95.7% average success rate on 12-object grasping with zero-shot transfer and strong generalization to unseen objects.Project website: https://diffudepgrasp.github.io/.

Paper Structure

This paper contains 26 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: DiffuDepGrasp Framework. In the (A) Teacher Policy Training stage, we leverage privileged state information in simulation to train a high-performance, RL-based teacher policy for collecting expert demonstrations. The (B) Diffusion Depth Generator (DDG) stage consists of two core modules. The first Diffusion Depth Module is trained on real-world, collected RGB-D data to learn the sensor's noise distribution. Note: $k$ denotes the diffusion process timestep, distinct from policy timestep $t$ in Sec. \ref{['sec:rl']}. The second Noise Grafting Module is designed to inject these learned artifacts into pristine simulation geometry. During inference, the complete DDG algorithm transforms simulated RGB-D data into high-fidelity, noisy depth maps. In the (C) Student Policy Distillation stage, we collect expert trajectories, convert their visual data into our generated noisy depth, a process facilitated by our Diffusion Depth Generator (B), and then distill the teacher's knowledge into a student policy via imitation learning. Finally, this student policy achieves zero-shot (D) Sim2Real Deployment, transferring directly to a physical robot to perform grasping tasks.
  • Figure 2: Comparison of Visual Representations for Sim2Real.(a) Simulated RGB and (f) Real-world RGB. (b) Clean ground-truth (GT) depth from simulation. (g) Raw, noisy depth from the real sensor. The inputs of baselines include: (c) GT depth with procedural random noise (Rand Noise), (h) inpainted real depth (Inpaint), and (d),(i) depth estimated by DAv2 from simulated and real RGBs. For comparison, (e) and (j) show the final, high-fidelity depth maps generated by our proposed DDG algorithm from the simulation and real-world data, respectively.
  • Figure 3: Qualitative Results of our Noisy Depth Data Generation. From top to bottom, the rows: (1) the original simulated RGB image; (2) the corresponding pristine, clean depth in simulation; (3) the generated depth maps of Diffusion Depth Module without Noise Grafting Module (DDG w/o G); and (4) the generated depth maps of Diffusion Depth Module with Noise Grafting Module (DDG).
  • Figure 4: t-SNE Visualization. Each subplot visualizes the feature distribution of real-world depth data (blue points) against the generated depth data via the simulation-based methods (orange points). We define the terms as follows: Real Raw: Raw depth from the physical sensor. Sim GT: Clean ground-truth depth from simulation. Sim Rand Noise: Sim GT depth with procedural noise. Real Inpaint: Real Raw depth after applying inpainting algorithm. Sim/Real DAv2: Depth estimated by Depth Anything V2 from sim/real RGB. Sim DDG w/o G: Depth generated by Diffusion Depth Module without Noise Grafting. Sim DDG: Depth generated by Diffusion Depth Module with Noise Grafting. The specific pairs compared are: (a) Real Raw vs. Sim GT; (b) Real Raw vs. Sim Random Noise; (c) Real Inpaint vs. Sim GT; (d) Real DAv2 vs. Sim DAv2; (e) Real Raw vs. Sim DDG w/o Noise Grafting; (f) Real Raw vs. Sim DDG.
  • Figure 5: The objects used to train policies in simulation and real world.