Table of Contents
Fetching ...

SiLVR: Scalable Lidar-Visual Radiance Field Reconstruction with Uncertainty Quantification

Yifu Tao, Maurice Fallon

TL;DR

SiLVR presents a scalable lidar-visual NeRF framework for large-scale 3D reconstruction that integrates depth and surface-normal cues from LiDAR with multi-view imagery. By embedding a perturbation-based perturbation field and applying the Laplace approximation, it yields an explicit epistemic uncertainty map ($oldsymbol{H}^{-1}$) to quantify sensor contributions and filter artefacts, especially at submap boundaries. The system uses depth-KL and normal regularisation, sky segmentation, and visibility-based submapping, complemented by COLMAP-based pose refinement, to deliver geometrically accurate maps with photoreal textures across over $20{,}000~ ext{m}^2$ of real-world data. This uncertainty-aware, large-scale fusion enables more reliable navigation, view planning, and mapping in robotics applications where textureless or occluded regions pose challenges.

Abstract

We present a neural radiance field (NeRF) based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photorealistic texture. Our system adopts the state-of-the-art NeRF representation to incorporate lidar. Adding lidar data adds strong geometric constraints on the depth and surface normals, which is particularly useful when modelling uniform texture surfaces which contain ambiguous visual reconstruction cues. A key contribution of this work is a novel method to quantify the epistemic uncertainty of the lidar-visual NeRF reconstruction by estimating the spatial variance of each point location in the radiance field given the sensor observations from the cameras and lidar. This provides a principled approach to evaluate the contribution of each sensor modality to the final reconstruction. In this way, reconstructions that are uncertain (due to e.g. uniform visual texture, limited observation viewpoints, or little lidar coverage) can be identified and removed. Our system is integrated with a real-time lidar SLAM system which is used to bootstrap a Structure-from-Motion (SfM) reconstruction procedure. It also helps to properly constrain the overall metric scale which is essential for the lidar depth loss. The refined SLAM trajectory can then be divided into submaps using Spectral Clustering to group sets of co-visible images together. This submapping approach is more suitable for visual reconstruction than distance-based partitioning. Our uncertainty estimation is particularly effective when merging submaps as their boundaries often contain artefacts due to limited observations. We demonstrate the reconstruction system using a multi-camera, lidar sensor suite in experiments involving both robot-mounted and handheld scanning. Our test datasets cover a total area of more than 20,000 square metres.

SiLVR: Scalable Lidar-Visual Radiance Field Reconstruction with Uncertainty Quantification

TL;DR

SiLVR presents a scalable lidar-visual NeRF framework for large-scale 3D reconstruction that integrates depth and surface-normal cues from LiDAR with multi-view imagery. By embedding a perturbation-based perturbation field and applying the Laplace approximation, it yields an explicit epistemic uncertainty map () to quantify sensor contributions and filter artefacts, especially at submap boundaries. The system uses depth-KL and normal regularisation, sky segmentation, and visibility-based submapping, complemented by COLMAP-based pose refinement, to deliver geometrically accurate maps with photoreal textures across over of real-world data. This uncertainty-aware, large-scale fusion enables more reliable navigation, view planning, and mapping in robotics applications where textureless or occluded regions pose challenges.

Abstract

We present a neural radiance field (NeRF) based large-scale reconstruction system that fuses lidar and vision data to generate high-quality reconstructions that are geometrically accurate and capture photorealistic texture. Our system adopts the state-of-the-art NeRF representation to incorporate lidar. Adding lidar data adds strong geometric constraints on the depth and surface normals, which is particularly useful when modelling uniform texture surfaces which contain ambiguous visual reconstruction cues. A key contribution of this work is a novel method to quantify the epistemic uncertainty of the lidar-visual NeRF reconstruction by estimating the spatial variance of each point location in the radiance field given the sensor observations from the cameras and lidar. This provides a principled approach to evaluate the contribution of each sensor modality to the final reconstruction. In this way, reconstructions that are uncertain (due to e.g. uniform visual texture, limited observation viewpoints, or little lidar coverage) can be identified and removed. Our system is integrated with a real-time lidar SLAM system which is used to bootstrap a Structure-from-Motion (SfM) reconstruction procedure. It also helps to properly constrain the overall metric scale which is essential for the lidar depth loss. The refined SLAM trajectory can then be divided into submaps using Spectral Clustering to group sets of co-visible images together. This submapping approach is more suitable for visual reconstruction than distance-based partitioning. Our uncertainty estimation is particularly effective when merging submaps as their boundaries often contain artefacts due to limited observations. We demonstrate the reconstruction system using a multi-camera, lidar sensor suite in experiments involving both robot-mounted and handheld scanning. Our test datasets cover a total area of more than 20,000 square metres.

Paper Structure

This paper contains 41 sections, 18 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Two large-scale reconstructions generated by SiLVR. Rendered RGB and surface normal images from the reconstructions are shown on each side. SiLVR combines visual and lidar information to create geometrically accurate maps with photorealistic textures, while considering sensor uncertainty. SiLVR uses submaps to scale to large-scale building complexes.
  • Figure 2: System overview: SiLVR builds large-scale reconstructions using images and lidar data, and a pose trajectory estimated by a separate odometry system. The sensor streams are provided by the Frontier, our custom perception payload carrying three fisheye colour cameras, IMU measurements, and a 3D lidar. When collecting the data, we used VILENS wisth2023vilens to estimate the trajectory of the sensor, which is refined in post-processing using COLMAP schoenberger2016colmap and partitioned into submaps. The camera image, lidar depth, and a derivative normal image are used to train a NeRF to achieve a final 3D reconstruction. After training the NeRF, SiLVR estimates the epistemic uncertainty of the radiance field. Finally, the point cloud reconstruction is extracted from the NeRF by rendering a depth for each of the training rays. The point cloud is then filtered using per-point uncertainty estimates to remove unreliable reconstructions.
  • Figure 3: Comparison of surface normal renderings of the Maths Institute. Incorporating normal constraints in addition to depth from lidar improves the smoothness of the reconstruction. Right: The smooth reconstruction of the ground portion highlights this improvement.
  • Figure 4: Sample Data from our diverse robotic datasets. Here, each image is overlaid with a projected lidar point cloud to demonstrate the accuracy of the sensor calibration.
  • Figure 5: Comparison of reconstruction quality of VILENS-SLAM, Nerfacto (vision-only) and our approach in small-scale scenes. Reconstructions are coloured using the point-to-point distance between the respective reconstructions and the ground truth scan with increasing error from blue (0m) to red (1m). The trajectory is shown in purple and overlaid on the ground truth scan captured using a Leica BLK360. The zoomed-in views show the difference in reconstruction quality. Overall, our approach is more complete w.r.t lidar-only reconstruction, and geometrically more consistent w.r.t vision-only reconstruction.
  • ...and 7 more figures