Table of Contents
Fetching ...

RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

Xingyu Liu, Chenyangguang Zhang, Gu Wang, Ruida Zhang, Xiangyang Ji

TL;DR

RaSim tackles the depth-domain sim-to-real gap by simulating RealSense D400-style depth sensors and introducing a range-aware rendering strategy that leverages near IR and far RGB cues. It builds a large-scale, photorealistic synthetic RGB-D dataset and trains SDRNet to restore ground-truth depth, while also pre-training depth branches of Transformer backbones to boost real-world tasks. Across depth completion on ClearGrasp and depth-based pose estimation on YCB-V, models trained solely on RaSim achieve competitive or superior performance without finetuning, demonstrating strong cross-domain transfer. The work highlights the practical impact of depth-focused synthetic data for real-world RGB-D perception and points toward expanding RaSim to additional sensors and applications.

Abstract

In robotic vision, a de-facto paradigm is to learn in simulated environments and then transfer to real-world applications, which poses an essential challenge in bridging the sim-to-real domain gap. While mainstream works tackle this problem in the RGB domain, we focus on depth data synthesis and develop a range-aware RGB-D data simulation pipeline (RaSim). In particular, high-fidelity depth data is generated by imitating the imaging principle of real-world sensors. A range-aware rendering strategy is further introduced to enrich data diversity. Extensive experiments show that models trained with RaSim can be directly applied to real-world scenarios without any finetuning and excel at downstream RGB-D perception tasks.

RaSim: A Range-aware High-fidelity RGB-D Data Simulation Pipeline for Real-world Applications

TL;DR

RaSim tackles the depth-domain sim-to-real gap by simulating RealSense D400-style depth sensors and introducing a range-aware rendering strategy that leverages near IR and far RGB cues. It builds a large-scale, photorealistic synthetic RGB-D dataset and trains SDRNet to restore ground-truth depth, while also pre-training depth branches of Transformer backbones to boost real-world tasks. Across depth completion on ClearGrasp and depth-based pose estimation on YCB-V, models trained solely on RaSim achieve competitive or superior performance without finetuning, demonstrating strong cross-domain transfer. The work highlights the practical impact of depth-focused synthetic data for real-world RGB-D perception and points toward expanding RaSim to additional sensors and applications.

Abstract

In robotic vision, a de-facto paradigm is to learn in simulated environments and then transfer to real-world applications, which poses an essential challenge in bridging the sim-to-real domain gap. While mainstream works tackle this problem in the RGB domain, we focus on depth data synthesis and develop a range-aware RGB-D data simulation pipeline (RaSim). In particular, high-fidelity depth data is generated by imitating the imaging principle of real-world sensors. A range-aware rendering strategy is further introduced to enrich data diversity. Extensive experiments show that models trained with RaSim can be directly applied to real-world scenarios without any finetuning and excel at downstream RGB-D perception tasks.
Paper Structure (15 sections, 6 equations, 6 figures, 3 tables)

This paper contains 15 sections, 6 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of the core idea. We first generate high-fidelity simulated depth maps by imitating the imaging principle of the stereo camera, and further design a range-aware rendering strategy that renders binocular IR or RGB images according to distance to enrich data diversity. Then an SDRNet is devised to restore the ground-truth depth from simulated depth.
  • Figure 2: The pipeline of RaSim. Given the virtual scene constructed by objects, background, and global illumination, the left and right cameras take videos under chronological physical simulation. Subsequently, the simulated depth maps are generated by the semi-global stereo-matching algorithm from binocular images.
  • Figure 3: The architecture of SDRNet.
  • Figure 4: The architecture for object pose estimation.
  • Figure 5: Visualization results of depth restoration on YCB-V.
  • ...and 1 more figures