Table of Contents
Fetching ...

NeRF-VINS: A Real-time Neural Radiance Field Map-based Visual-Inertial Navigation System

Saimouli Katragadda, Woosik Lee, Yuxiang Peng, Patrick Geneva, Chuchu Chen, Chao Guo, Mingyang Li, Guoquan Huang

TL;DR

The paper addresses drift in map-based localization by introducing NeRF-VINS, a real-time, tightly-coupled visual–inertial system that fuses an a priori NeRF map with IMU and monocular imagery through an MSCKF-based filter. It leverages NeRF-rendered novel views to obtain informative measurements, enabling drift-free, centimeter-level pose estimates at over 10 Hz on edge hardware like the Jetson AGX Orin. The approach includes offline NeRF map generation, careful descriptor selection (favoring SuperPoint), and a rendering pipeline that balances speed and fidelity via half-resolution renders and FSRCNN upsampling. Experiments on the AR Table dataset show NeRF-VINS outperforms traditional map-based methods and many baselines in accuracy and robustness, while maintaining real-time performance and resilience to environmental changes.

Abstract

Achieving efficient and consistent localization a prior map remains challenging in robotics. Conventional keyframe-based approaches often suffers from sub-optimal viewpoints due to limited field of view (FOV) and/or constrained motion, thus degrading the localization performance. To address this issue, we design a real-time tightly-coupled Neural Radiance Fields (NeRF)-aided visual-inertial navigation system (VINS). In particular, by effectively leveraging the NeRF's potential to synthesize novel views, the proposed NeRF-VINS overcomes the limitations of traditional keyframe-based maps (with limited views) and optimally fuses IMU, monocular images, and synthetically rendered images within an efficient filter-based framework. This tightly-coupled fusion enables efficient 3D motion tracking with bounded errors. We extensively compare the proposed NeRF-VINS against the state-of-the-art methods that use prior map information and demonstrate its ability to perform real-time localization, at over 10 Hz, on a resource-constrained Jetson AGX Orin embedded platform.

NeRF-VINS: A Real-time Neural Radiance Field Map-based Visual-Inertial Navigation System

TL;DR

The paper addresses drift in map-based localization by introducing NeRF-VINS, a real-time, tightly-coupled visual–inertial system that fuses an a priori NeRF map with IMU and monocular imagery through an MSCKF-based filter. It leverages NeRF-rendered novel views to obtain informative measurements, enabling drift-free, centimeter-level pose estimates at over 10 Hz on edge hardware like the Jetson AGX Orin. The approach includes offline NeRF map generation, careful descriptor selection (favoring SuperPoint), and a rendering pipeline that balances speed and fidelity via half-resolution renders and FSRCNN upsampling. Experiments on the AR Table dataset show NeRF-VINS outperforms traditional map-based methods and many baselines in accuracy and robustness, while maintaining real-time performance and resilience to environmental changes.

Abstract

Achieving efficient and consistent localization a prior map remains challenging in robotics. Conventional keyframe-based approaches often suffers from sub-optimal viewpoints due to limited field of view (FOV) and/or constrained motion, thus degrading the localization performance. To address this issue, we design a real-time tightly-coupled Neural Radiance Fields (NeRF)-aided visual-inertial navigation system (VINS). In particular, by effectively leveraging the NeRF's potential to synthesize novel views, the proposed NeRF-VINS overcomes the limitations of traditional keyframe-based maps (with limited views) and optimally fuses IMU, monocular images, and synthetically rendered images within an efficient filter-based framework. This tightly-coupled fusion enables efficient 3D motion tracking with bounded errors. We extensively compare the proposed NeRF-VINS against the state-of-the-art methods that use prior map information and demonstrate its ability to perform real-time localization, at over 10 Hz, on a resource-constrained Jetson AGX Orin embedded platform.
Paper Structure (19 sections, 6 equations, 7 figures, 4 tables)

This paper contains 19 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Overview of the proposed NeRF-VINS, where $\{{G}\}$ is the global VIO frame, $\{{N}\}$ is the map frame, $\{{K}\}$ denotes the NeRF rendered image. $\{{I}\}$ and $\{{C}\}$ are IMU and camera frame, respectively. Click on the image for a video demo.
  • Figure 2: Example rendered images for testing matching methods. Left: Rendered image with resolution 424$\times$240. Right: Rendered image with 141$\times$80 resolution and up-scaled to 424$\times$240 with FSRCNN Dong2016_ECCV.
  • Figure 3: Qualitative study of failure cases of classical place recognition method. Green and Red lines indicate inliers and outliers, respectively. Input image (left of each column) and retrieved, rendered for the NeRF case (resolution 212$\times$140 and upsampled to 424$\times$240), image is shown (right of each column). Images are shown in color for visualization purposes.
  • Figure 4: Qualitative comparison of NeRF Map trained with different methods using 543 keyframe images. The top row shows the PSNR histograms and the bottom row shows exemplary images rendered from each method.
  • Figure 5: Exemplary environment configurations in Chen2023ICRA.
  • ...and 2 more figures