Table of Contents
Fetching ...

S-NeRF: Neural Radiance Fields for Street Views

Ziyang Xie, Junge Zhang, Wenye Li, Feihu Zhang, Li Zhang

TL;DR

S-NeRF extends Neural Radiance Fields to street-scale data by jointly modeling large-scale backgrounds and foreground moving vehicles. It introduces a robust pipeline with improved scene parameterization, pose refinement for static background, a virtual-camera transform for moving objects, and depth supervision via noisy, sparse LiDAR with a learnable confidence mechanism that fuses reprojection and geometry cues. Depth completion with NLSPN and a confidence-weighted depth loss enable reliable training from imperfect LiDAR data, while a comprehensive ablation study confirms the contribution of each component. Across nuScenes and Waymo, S-NeRF surpasses strong baselines (Mip-NeRF, Mip-NeRF360, Urban-NeRF, GeoSim) in both static street views and moving-vehicle rendering, demonstrating practical potential for driving simulations and AR/VR applications. Some depth artifacts remain (e.g., reflective windows), with future work targeting city-scale representations via block merging.

Abstract

Neural Radiance Fields (NeRFs) aim to synthesize novel views of objects and scenes, given the object-centric camera views with large overlaps. However, we conjugate that this paradigm does not fit the nature of the street views that are collected by many self-driving cars from the large-scale unbounded scenes. Also, the onboard cameras perceive scenes without much overlapping. Thus, existing NeRFs often produce blurs, 'floaters' and other artifacts on street-view synthesis. In this paper, we propose a new street-view NeRF (S-NeRF) that considers novel view synthesis of both the large-scale background scenes and the foreground moving vehicles jointly. Specifically, we improve the scene parameterization function and the camera poses for learning better neural representations from street views. We also use the the noisy and sparse LiDAR points to boost the training and learn a robust geometry and reprojection based confidence to address the depth outliers. Moreover, we extend our S-NeRF for reconstructing moving vehicles that is impracticable for conventional NeRFs. Thorough experiments on the large-scale driving datasets (e.g., nuScenes and Waymo) demonstrate that our method beats the state-of-the-art rivals by reducing 7% to 40% of the mean-squared error in the street-view synthesis and a 45% PSNR gain for the moving vehicles rendering.

S-NeRF: Neural Radiance Fields for Street Views

TL;DR

S-NeRF extends Neural Radiance Fields to street-scale data by jointly modeling large-scale backgrounds and foreground moving vehicles. It introduces a robust pipeline with improved scene parameterization, pose refinement for static background, a virtual-camera transform for moving objects, and depth supervision via noisy, sparse LiDAR with a learnable confidence mechanism that fuses reprojection and geometry cues. Depth completion with NLSPN and a confidence-weighted depth loss enable reliable training from imperfect LiDAR data, while a comprehensive ablation study confirms the contribution of each component. Across nuScenes and Waymo, S-NeRF surpasses strong baselines (Mip-NeRF, Mip-NeRF360, Urban-NeRF, GeoSim) in both static street views and moving-vehicle rendering, demonstrating practical potential for driving simulations and AR/VR applications. Some depth artifacts remain (e.g., reflective windows), with future work targeting city-scale representations via block merging.

Abstract

Neural Radiance Fields (NeRFs) aim to synthesize novel views of objects and scenes, given the object-centric camera views with large overlaps. However, we conjugate that this paradigm does not fit the nature of the street views that are collected by many self-driving cars from the large-scale unbounded scenes. Also, the onboard cameras perceive scenes without much overlapping. Thus, existing NeRFs often produce blurs, 'floaters' and other artifacts on street-view synthesis. In this paper, we propose a new street-view NeRF (S-NeRF) that considers novel view synthesis of both the large-scale background scenes and the foreground moving vehicles jointly. Specifically, we improve the scene parameterization function and the camera poses for learning better neural representations from street views. We also use the the noisy and sparse LiDAR points to boost the training and learn a robust geometry and reprojection based confidence to address the depth outliers. Moreover, we extend our S-NeRF for reconstructing moving vehicles that is impracticable for conventional NeRFs. Thorough experiments on the large-scale driving datasets (e.g., nuScenes and Waymo) demonstrate that our method beats the state-of-the-art rivals by reducing 7% to 40% of the mean-squared error in the street-view synthesis and a 45% PSNR gain for the moving vehicles rendering.
Paper Structure (41 sections, 8 equations, 16 figures, 18 tables)

This paper contains 41 sections, 8 equations, 16 figures, 18 tables.

Figures (16)

  • Figure 1: Problem illustration. (a) Conventional NeRFs nerfmipnerf require object-centric camera views with large overlaps. (b) In the challenging large-scale outdoor driving scenes nuscenes2019Sun_2020_CVPR), the camera placements for data collection are usually in a panoramic view settings. Rays from different cameras barely intersect with others in the unbounded scenes. The overlapped field of view between adjacent cameras is too small to be effective for training the existing NeRF models.
  • Figure 2: Performance illustration in novel view rendering on a challenging nuScenes scene nuscenes2019, (a) the state-of-the-art method mipnerf360 produces poor results with blurred texture details and plenty of depth errors, (b) our S-NeRF can achieve accurate depth maps and fine texture details with fewer artifacts. (d) Our method can also be used for the reconstruction of moving vehicles which is impossible for previous NeRFs. It can synthesize better novel views compared with the mesh method geosim.
  • Figure 3: Depth supervision and rendering.
  • Figure 4: Illustration of our camera transformation process for moving vehicles. During the data collection, the ego car (camera) is moving and the target car (object) is also moving. The virtual camera system treats the target car (moving object) as static and then compute the relative camera poses for the ego car's camera. These relative camera poses can be estimated through the 3D object detectors. After the transformation, only the camera is moving which is favorable in training NeRFs.
  • Figure 5: Novel-view synthesis results for static foreground vehicles. Results are reconstructed from 4$\sim$7 views. Our method outperforms others geosimnerf with more texture details and accurate shapes.
  • ...and 11 more figures