Table of Contents
Fetching ...

NeRF-VIO: Map-Based Visual-Inertial Odometry with Initialization Leveraging Neural Radiance Fields

Yanyu Zhang, Dongming Wang, Jie Xu, Mengyuan Liu, Pengxiang Zhu, Wei Ren

TL;DR

The paper tackles drift and relocalization challenges in map-based visual-inertial odometry for AR by introducing NeRF-VIO, which uses a pose-initialization MLP to relocalize the first frame within a pre-trained NeRF map and a two-stage update in an MSCKF framework that fuses both captured and NeRF-rendered images. A left-invariant SE(3) geodesic loss and a left-invariant metric on $\mathfrak{se}(3)$ ensure robust initialization across frame changes, while grid-based SSIM mitigates environmental alterations. Experiments on a real AR table dataset show that NeRF-VIO achieves superior initialization accuracy and latency compared to iNeRF and outperforms MSCKF in VIO accuracy, even under significant scene changes. The approach demonstrates practical value for robust, real-time AR localization using a pre-built NeRF map and online NeRF rendering.

Abstract

A prior map serves as a foundational reference for localization in context-aware applications such as augmented reality (AR). Providing valuable contextual information about the environment, the prior map is a vital tool for mitigating drift. In this paper, we propose a map-based visual-inertial localization algorithm (NeRF-VIO) with initialization using neural radiance fields (NeRF). Our algorithm utilizes a multilayer perceptron model and redefines the loss function as the geodesic distance on \(SE(3)\), ensuring the invariance of the initialization model under a frame change within \(\mathfrak{se}(3)\). The evaluation demonstrates that our model outperforms existing NeRF-based initialization solution in both accuracy and efficiency. By integrating a two-stage update mechanism within a multi-state constraint Kalman filter (MSCKF) framework, the state of NeRF-VIO is constrained by both captured images from an onboard camera and rendered images from a pre-trained NeRF model. The proposed algorithm is validated using a real-world AR dataset, the results indicate that our two-stage update pipeline outperforms MSCKF across all data sequences.

NeRF-VIO: Map-Based Visual-Inertial Odometry with Initialization Leveraging Neural Radiance Fields

TL;DR

The paper tackles drift and relocalization challenges in map-based visual-inertial odometry for AR by introducing NeRF-VIO, which uses a pose-initialization MLP to relocalize the first frame within a pre-trained NeRF map and a two-stage update in an MSCKF framework that fuses both captured and NeRF-rendered images. A left-invariant SE(3) geodesic loss and a left-invariant metric on ensure robust initialization across frame changes, while grid-based SSIM mitigates environmental alterations. Experiments on a real AR table dataset show that NeRF-VIO achieves superior initialization accuracy and latency compared to iNeRF and outperforms MSCKF in VIO accuracy, even under significant scene changes. The approach demonstrates practical value for robust, real-time AR localization using a pre-built NeRF map and online NeRF rendering.

Abstract

A prior map serves as a foundational reference for localization in context-aware applications such as augmented reality (AR). Providing valuable contextual information about the environment, the prior map is a vital tool for mitigating drift. In this paper, we propose a map-based visual-inertial localization algorithm (NeRF-VIO) with initialization using neural radiance fields (NeRF). Our algorithm utilizes a multilayer perceptron model and redefines the loss function as the geodesic distance on \(SE(3)\), ensuring the invariance of the initialization model under a frame change within \(\mathfrak{se}(3)\). The evaluation demonstrates that our model outperforms existing NeRF-based initialization solution in both accuracy and efficiency. By integrating a two-stage update mechanism within a multi-state constraint Kalman filter (MSCKF) framework, the state of NeRF-VIO is constrained by both captured images from an onboard camera and rendered images from a pre-trained NeRF model. The proposed algorithm is validated using a real-world AR dataset, the results indicate that our two-stage update pipeline outperforms MSCKF across all data sequences.

Paper Structure

This paper contains 15 sections, 1 theorem, 30 equations, 8 figures, 3 tables.

Key Result

Lemma 1

Left-invariant: The metric defined in (eq:init3) is left-invariant.

Figures (8)

  • Figure 1: An overview of our NeRF-VIO framework. Commencing with the initial captured image, the pre-trained initialization model (yellow) outputs the first pose of the camera frame. Utilizing IMU integration from the timestamp of the initial IMU measurement to that of the first camera measurement, we deduce the initial IMU state backward. Throughout online traveling, we leverage both the pre-trained NeRF model (green) and the onboard camera to establish spatial constraints, facilitating the update of poses within the current sliding window. These updated poses then undergo further IMU propagation, serving as input to the NeRF model for the rendering of subsequent images.
  • Figure 2: Comparison of input and output during model inference. The Init model estimates the camera pose in the world frame of a prior map based on a captured image. Conversely, the NeRF model renders an image when provided with a specific camera pose.
  • Figure 3: IMU pose initialization. From the init model, the relative pose between the first camera frame and the prior map frame can be determined. With the camera-IMU calibration parameters and the timestamps, the transformation between the first camera frame and the first IMU frame can be found.
  • Figure 4: The three timelines denote data received from different sensors and the NeRF model. We define the closest camera frame {$CC$} as the frame closest in time to when the NeRF model begins rendering.
  • Figure 5: Testing results of NeRF model. (a) Groundtruth of test image. (b) Rendered image at iteration 1000. (c) Rendered image at iteration 50000. (d) Rendered image at iteration 200000.
  • ...and 3 more figures

Theorems & Definitions (2)

  • Lemma 1
  • proof