Table of Contents
Fetching ...

NeRF-VO: Real-Time Sparse Visual Odometry with Neural Radiance Fields

Jens Naumann, Binbin Xu, Stefan Leutenegger, Xingxing Zuo

TL;DR

NeRF-VO tackles real-time monocular visual odometry by fusing a fast, learning-based sparse VO front-end with a neural implicit dense mapping back-end. It uses sparse pose and patch-depth estimates from DPVO, augments them with monocular dense depth and surface normals, and optimizes a Nerfacto-based NeRF in a jointly learned framework with pose refinement. The key contributions include the sparse-to-dense scale alignment, monocular depth/normal priors, and a tightly integrated, asynchronous system that achieves state-of-the-art pose accuracy, dense reconstruction, and novel view synthesis while maintaining low tracking latency and memory usage. The work demonstrates strong performance across synthetic and real datasets, highlighting potential for real-time neural SLAM-enabled robotics and AR applications, and suggests future work to further fuse scene constraints into the pose front-end.

Abstract

We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.

NeRF-VO: Real-Time Sparse Visual Odometry with Neural Radiance Fields

TL;DR

NeRF-VO tackles real-time monocular visual odometry by fusing a fast, learning-based sparse VO front-end with a neural implicit dense mapping back-end. It uses sparse pose and patch-depth estimates from DPVO, augments them with monocular dense depth and surface normals, and optimizes a Nerfacto-based NeRF in a jointly learned framework with pose refinement. The key contributions include the sparse-to-dense scale alignment, monocular depth/normal priors, and a tightly integrated, asynchronous system that achieves state-of-the-art pose accuracy, dense reconstruction, and novel view synthesis while maintaining low tracking latency and memory usage. The work demonstrates strong performance across synthetic and real datasets, highlighting potential for real-time neural SLAM-enabled robotics and AR applications, and suggests future work to further fuse scene constraints into the pose front-end.

Abstract

We introduce a novel monocular visual odometry (VO) system, NeRF-VO, that integrates learning-based sparse visual odometry for low-latency camera tracking and a neural radiance scene representation for fine-detailed dense reconstruction and novel view synthesis. Our system initializes camera poses using sparse visual odometry and obtains view-dependent dense geometry priors from a monocular prediction network. We harmonize the scale of poses and dense geometry, treating them as supervisory cues to train a neural implicit scene representation. NeRF-VO demonstrates exceptional performance in both photometric and geometric fidelity of the scene representation by jointly optimizing a sliding window of keyframed poses and the underlying dense geometry, which is accomplished through training the radiance field with volume rendering. We surpass SOTA methods in pose estimation accuracy, novel view synthesis fidelity, and dense reconstruction quality across a variety of synthetic and real-world datasets while achieving a higher camera tracking frequency and consuming less GPU memory.
Paper Structure (17 sections, 10 equations, 2 figures, 9 tables)

This paper contains 17 sections, 10 equations, 2 figures, 9 tables.

Figures (2)

  • Figure 1: System architecture of NeRF-VO. The method uses only a sequence of RGB images as input. The sparse visual tracking module selects keyframes from this input stream and calculates camera poses and depth values for a set of sparse patches. Additionally, the dense geometry enhancement module predicts dense depth maps and surface normals and aligns them with the sparse depth from the tracking module. The NeRF-based dense mapping module utilizes raw RGB images, inferred depth maps, surface normals, and camera poses to optimize a neural implicit representation and refine the camera poses. Our system is capable of performing high-quality 3D dense reconstruction and rendering images at novel views.
  • Figure 2: 3D reconstructions of five scenes from Replica replica19arxiv. The pictures of GO-SLAM goslam and HI-SLAM hislam have been taken from their respective papers. Arrows highlight selected prominent artifacts and defects.