Table of Contents
Fetching ...

NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields

Antoni Rosinol, John J. Leonard, Luca Carlone

TL;DR

This work tackles monocular real-time 3D reconstruction by fusing dense monocular SLAM with a probabilistic, hash-based neural radiance field (NeRF). It leverages poses, dense depths, and per-pixel depth/pose covariances from dense SLAM to weight depth supervision in a real-time NeRF training pipeline, achieving superior geometric and photometric accuracy without input depth or pose data. The method demonstrates state-of-the-art results on Replica compared to TSDF-based methods and recent NeRF-SLAM approaches, while maintaining real-time performance. Limitations include high GPU memory usage, with proposed mitigations and future directions toward metric-semantic SLAM and dynamic scene understanding.

Abstract

We propose a novel geometric and photometric 3D mapping pipeline for accurate and real-time scene reconstruction from monocular images. To achieve this, we leverage recent advances in dense monocular SLAM and real-time hierarchical volumetric neural radiance fields. Our insight is that dense monocular SLAM provides the right information to fit a neural radiance field of the scene in real-time, by providing accurate pose estimates and depth-maps with associated uncertainty. With our proposed uncertainty-based depth loss, we achieve not only good photometric accuracy, but also great geometric accuracy. In fact, our proposed pipeline achieves better geometric and photometric accuracy than competing approaches (up to 179% better PSNR and 86% better L1 depth), while working in real-time and using only monocular images.

NeRF-SLAM: Real-Time Dense Monocular SLAM with Neural Radiance Fields

TL;DR

This work tackles monocular real-time 3D reconstruction by fusing dense monocular SLAM with a probabilistic, hash-based neural radiance field (NeRF). It leverages poses, dense depths, and per-pixel depth/pose covariances from dense SLAM to weight depth supervision in a real-time NeRF training pipeline, achieving superior geometric and photometric accuracy without input depth or pose data. The method demonstrates state-of-the-art results on Replica compared to TSDF-based methods and recent NeRF-SLAM approaches, while maintaining real-time performance. Limitations include high GPU memory usage, with proposed mitigations and future directions toward metric-semantic SLAM and dynamic scene understanding.

Abstract

We propose a novel geometric and photometric 3D mapping pipeline for accurate and real-time scene reconstruction from monocular images. To achieve this, we leverage recent advances in dense monocular SLAM and real-time hierarchical volumetric neural radiance fields. Our insight is that dense monocular SLAM provides the right information to fit a neural radiance field of the scene in real-time, by providing accurate pose estimates and depth-maps with associated uncertainty. With our proposed uncertainty-based depth loss, we achieve not only good photometric accuracy, but also great geometric accuracy. In fact, our proposed pipeline achieves better geometric and photometric accuracy than competing approaches (up to 179% better PSNR and 86% better L1 depth), while working in real-time and using only monocular images.
Paper Structure (18 sections, 7 equations, 6 figures, 1 table)

This paper contains 18 sections, 7 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: From left to right, input RGB image, estimated depth uncertainty, back-projected depth-maps into a pointcloud, after thresholding the depth by its uncertainty ($\sigma_d\leq1.0\xspace$) for visualization, and our resulting neural radiance field, rendered from the same viewpoint as the input image. Our pipeline is capable of reconstructing neural radiance fields in real-time given only a stream of RGB images.
  • Figure 2: The input to our pipeline consists of sequential monocular images (here represented as Img 1 & Img 2). Starting from the top-right, our architecture fits a NeRF using Instant-NGP muller2022instant, which we supervise using RGB images $\mathbf{I}$, depths $\mathbf{D}$, where the depths are weighted by their marginal covariance, $\mathbf{\Sigma_D}$. Inspired by Rosinol et al. Rosinol22wacv, we compute these covariances from dense monocular SLAM. In our case, we use Droid-SLAM teed2021droid. We provide more details about the flow of information in \ref{['sec:tracking']}. In blue, we show Droid-SLAM's teed2021droid contributions and flow of information, similarly, in pink are Rosinol's contribution Rosinol22wacv, and in red, our contribution.
  • Figure 3: Qualitative results on the Replica office-0 dataset using different mapping approaches. From top to bottom, raw pointcloud from our tracking module, TSDF reconstruction using $\sigma$-Fusion, Nice-SLAM's results, and ours.
  • Figure 4: Impact on the performance when using depth supervision with and without ground-truth depth, and when initializing the poses with ground-truth or noisy poses; compared with our approach which estimates dense depths and poses. Results after $60s$ of convergence.
  • Figure 5: (Top-Left) Raw pointcloud estimated by the tracking module, (Bottom-Left) Pointcloud after thresholding the depth uncertainty ($\sigma_d\leq1.0\xspace$) for visualization. (Right Column) Radiance field reconstructions after $120$s of optimization, with and without depth weighting (top-right and bottom-right respectively). Room scene in Cube-Diorama dataset abou2022implicit.
  • ...and 1 more figures