Augmented Reality based Simulated Data (ARSim) with multi-view consistency for AV perception networks
Aqeel Anwar, Tae Eun Choe, Zian Wang, Sanja Fidler, Minwoo Park
TL;DR
ARSim introduces a fully automated, modular AR-based data augmentation framework to bridge the covariate gap between real driving data and synthetic content. By inferring non-object domain attributes from real data and applying simulation-based randomization to object attributes, ARSim renders 3D synthetic assets in a multi-view, camera-parameter-consistent manner with HDR lighting, producing realistic augmented data with ground-truth labels. Across obstacle, freespace, and parking perception tasks, ARSim yields measurable improvements over real data alone and outperforms purely VR-based synthetic augmentation, with further gains when combining ARSim and VRSim data. This approach reduces the need for hand-crafted 3D scenes and can be adapted to other multi-camera perception problems beyond autonomous driving.
Abstract
Detecting a diverse range of objects under various driving scenarios is essential for the effectiveness of autonomous driving systems. However, the real-world data collected often lacks the necessary diversity presenting a long-tail distribution. Although synthetic data has been utilized to overcome this issue by generating virtual scenes, it faces hurdles such as a significant domain gap and the substantial efforts required from 3D artists to create realistic environments. To overcome these challenges, we present ARSim, a fully automated, comprehensive, modular framework designed to enhance real multi-view image data with 3D synthetic objects of interest. The proposed method integrates domain adaptation and randomization strategies to address covariate shift between real and simulated data by inferring essential domain attributes from real data and employing simulation-based randomization for other attributes. We construct a simplified virtual scene using real data and strategically place 3D synthetic assets within it. Illumination is achieved by estimating light distribution from multiple images capturing the surroundings of the vehicle. Camera parameters from real data are employed to render synthetic assets in each frame. The resulting augmented multi-view consistent dataset is used to train a multi-camera perception network for autonomous vehicles. Experimental results on various AV perception tasks demonstrate the superior performance of networks trained on the augmented dataset.
