Table of Contents
Fetching ...

Neural Rendering based Urban Scene Reconstruction for Autonomous Driving

Shihao Shen, Louis Kerofsky, Varun Ravi Kumar, Senthil Yogamani

TL;DR

This paper tackles dense, accurate reconstruction of urban scenes for autonomous driving by uniting neural implicit surfaces with radiance fields to produce dense geometry and renderings from multimodal sensor data. It introduces a foreground-background decomposition, a dynamic-object filtering strategy based on 3D detections, and a divide-and-conquer training scheme to scale to large environments, all supervised by photometric, Eikonal, and LiDAR-derived geometry losses. The results show that incorporating LiDAR improves depth and geometry accuracy (e.g., PSNR from 26.211 to 31.993 and RMSE from 9.243 to 5.243 in the reported setup) and that dynamic object filtering reduces artifacts, enabling more reliable urban scene reconstruction. Overall, the method enables scalable, high-fidelity neural scene representations suitable for online annotation, data augmentation, and offline perception pipelines in autonomous driving.

Abstract

Dense 3D reconstruction has many applications in automated driving including automated annotation validation, multimodal data augmentation, providing ground truth annotations for systems lacking LiDAR, as well as enhancing auto-labeling accuracy. LiDAR provides highly accurate but sparse depth, whereas camera images enable estimation of dense depth but noisy particularly at long ranges. In this paper, we harness the strengths of both sensors and propose a multimodal 3D scene reconstruction using a framework combining neural implicit surfaces and radiance fields. In particular, our method estimates dense and accurate 3D structures and creates an implicit map representation based on signed distance fields, which can be further rendered into RGB images, and depth maps. A mesh can be extracted from the learned signed distance field and culled based on occlusion. Dynamic objects are efficiently filtered on the fly during sampling using 3D object detection models. We demonstrate qualitative and quantitative results on challenging automotive scenes.

Neural Rendering based Urban Scene Reconstruction for Autonomous Driving

TL;DR

This paper tackles dense, accurate reconstruction of urban scenes for autonomous driving by uniting neural implicit surfaces with radiance fields to produce dense geometry and renderings from multimodal sensor data. It introduces a foreground-background decomposition, a dynamic-object filtering strategy based on 3D detections, and a divide-and-conquer training scheme to scale to large environments, all supervised by photometric, Eikonal, and LiDAR-derived geometry losses. The results show that incorporating LiDAR improves depth and geometry accuracy (e.g., PSNR from 26.211 to 31.993 and RMSE from 9.243 to 5.243 in the reported setup) and that dynamic object filtering reduces artifacts, enabling more reliable urban scene reconstruction. Overall, the method enables scalable, high-fidelity neural scene representations suitable for online annotation, data augmentation, and offline perception pipelines in autonomous driving.

Abstract

Dense 3D reconstruction has many applications in automated driving including automated annotation validation, multimodal data augmentation, providing ground truth annotations for systems lacking LiDAR, as well as enhancing auto-labeling accuracy. LiDAR provides highly accurate but sparse depth, whereas camera images enable estimation of dense depth but noisy particularly at long ranges. In this paper, we harness the strengths of both sensors and propose a multimodal 3D scene reconstruction using a framework combining neural implicit surfaces and radiance fields. In particular, our method estimates dense and accurate 3D structures and creates an implicit map representation based on signed distance fields, which can be further rendered into RGB images, and depth maps. A mesh can be extracted from the learned signed distance field and culled based on occlusion. Dynamic objects are efficiently filtered on the fly during sampling using 3D object detection models. We demonstrate qualitative and quantitative results on challenging automotive scenes.
Paper Structure (9 sections, 5 equations, 5 figures)

This paper contains 9 sections, 5 equations, 5 figures.

Figures (5)

  • Figure 1: We demonstrate dense and accurate 3D structure being represented by an implicit model, which can be further rendered into RGB images and depth maps or reconstructed into a high-quality mesh.
  • Figure 2: Overview of our foreground model. It follows the same multiresolution hash grid design as in muller2022instant and predicts signed distance rather than density. The other MLP head predicts view-dependent color by taking in the viewing direction. Three major supervisions are photometric loss to supervise the reconstructed scene appearance, Eikonal loss to regularize the learned SDF, and geometry loss to supervise the reconstructed scene geometry.
  • Figure 3: Qualitative and quantitative results that demonstrate benefits of combining LiDAR and camera. PSNR$\uparrow$: $26.211$ without LiDAR and $31.993$ with LiDAR. RMSE$\downarrow$: $9.808$ without LiDAR and $5.243$ with LiDAR.
  • Figure 4: Qualitative results of dynamic object filtering
  • Figure 5: Large-scale support demonstration in BEV. Occlusion culling is not applied to the mesh for simplicity. Green boxes denote allocated spatial size for each subsequence. Top: Extracted mesh of the first subsequence. Bottom: Extracted mesh of the next subsequence and merged into the first subsequence.