Table of Contents
Fetching ...

Augmented Reality based Simulated Data (ARSim) with multi-view consistency for AV perception networks

Aqeel Anwar, Tae Eun Choe, Zian Wang, Sanja Fidler, Minwoo Park

TL;DR

ARSim introduces a fully automated, modular AR-based data augmentation framework to bridge the covariate gap between real driving data and synthetic content. By inferring non-object domain attributes from real data and applying simulation-based randomization to object attributes, ARSim renders 3D synthetic assets in a multi-view, camera-parameter-consistent manner with HDR lighting, producing realistic augmented data with ground-truth labels. Across obstacle, freespace, and parking perception tasks, ARSim yields measurable improvements over real data alone and outperforms purely VR-based synthetic augmentation, with further gains when combining ARSim and VRSim data. This approach reduces the need for hand-crafted 3D scenes and can be adapted to other multi-camera perception problems beyond autonomous driving.

Abstract

Detecting a diverse range of objects under various driving scenarios is essential for the effectiveness of autonomous driving systems. However, the real-world data collected often lacks the necessary diversity presenting a long-tail distribution. Although synthetic data has been utilized to overcome this issue by generating virtual scenes, it faces hurdles such as a significant domain gap and the substantial efforts required from 3D artists to create realistic environments. To overcome these challenges, we present ARSim, a fully automated, comprehensive, modular framework designed to enhance real multi-view image data with 3D synthetic objects of interest. The proposed method integrates domain adaptation and randomization strategies to address covariate shift between real and simulated data by inferring essential domain attributes from real data and employing simulation-based randomization for other attributes. We construct a simplified virtual scene using real data and strategically place 3D synthetic assets within it. Illumination is achieved by estimating light distribution from multiple images capturing the surroundings of the vehicle. Camera parameters from real data are employed to render synthetic assets in each frame. The resulting augmented multi-view consistent dataset is used to train a multi-camera perception network for autonomous vehicles. Experimental results on various AV perception tasks demonstrate the superior performance of networks trained on the augmented dataset.

Augmented Reality based Simulated Data (ARSim) with multi-view consistency for AV perception networks

TL;DR

ARSim introduces a fully automated, modular AR-based data augmentation framework to bridge the covariate gap between real driving data and synthetic content. By inferring non-object domain attributes from real data and applying simulation-based randomization to object attributes, ARSim renders 3D synthetic assets in a multi-view, camera-parameter-consistent manner with HDR lighting, producing realistic augmented data with ground-truth labels. Across obstacle, freespace, and parking perception tasks, ARSim yields measurable improvements over real data alone and outperforms purely VR-based synthetic augmentation, with further gains when combining ARSim and VRSim data. This approach reduces the need for hand-crafted 3D scenes and can be adapted to other multi-camera perception problems beyond autonomous driving.

Abstract

Detecting a diverse range of objects under various driving scenarios is essential for the effectiveness of autonomous driving systems. However, the real-world data collected often lacks the necessary diversity presenting a long-tail distribution. Although synthetic data has been utilized to overcome this issue by generating virtual scenes, it faces hurdles such as a significant domain gap and the substantial efforts required from 3D artists to create realistic environments. To overcome these challenges, we present ARSim, a fully automated, comprehensive, modular framework designed to enhance real multi-view image data with 3D synthetic objects of interest. The proposed method integrates domain adaptation and randomization strategies to address covariate shift between real and simulated data by inferring essential domain attributes from real data and employing simulation-based randomization for other attributes. We construct a simplified virtual scene using real data and strategically place 3D synthetic assets within it. Illumination is achieved by estimating light distribution from multiple images capturing the surroundings of the vehicle. Camera parameters from real data are employed to render synthetic assets in each frame. The resulting augmented multi-view consistent dataset is used to train a multi-camera perception network for autonomous vehicles. Experimental results on various AV perception tasks demonstrate the superior performance of networks trained on the augmented dataset.
Paper Structure (20 sections, 2 equations, 15 figures, 7 tables)

This paper contains 20 sections, 2 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: An overview of the proposed approach and its impact. (left) We generate an HDR light map from the input data and position assets of interest in 3D around the ego car, subsequently rendering them within the camera frame, handling collision and occlusion with existing real objects. Concurrently, the pipeline achieves multi-view consistent frame rendering (right top). Additionally, integrating ARSim data with real data enhances performance metrics across three crucial AV perception tasks: obstacle detection, freespace detection, and parking detection (right bottom), as demonstrated in detail in the results section.
  • Figure 2: (Left) An overview of ARSim's high-level block diagram. (Right) Example multi-view consistent data generated by ARSim with groundtruth generation and modification.
  • Figure 3: ARSim VRU data for improving obstacle detection
  • Figure 4: Impact of ARSim data on freespace detection
  • Figure 5: Improvement in person/biker class performance metrics using ARSim
  • ...and 10 more figures