Table of Contents
Fetching ...

IFFNeRF: Initialisation Free and Fast 6DoF pose estimation from a single image and a NeRF model

Matteo Bortolon, Theodore Tsesmelis, Stuart James, Fabio Poiesi, Alessio Del Bue

TL;DR

IFFNeRF is specifically designed to operate in real-time and eliminates the need for an initial pose guess that is proximate to the sought solution, and can improve the angular and translation error accuracy by 80.1% and 67.3%, respectively, compared to iNeRF while performing at 34fps on consumer hardware.

Abstract

We introduce IFFNeRF to estimate the six degrees-of-freedom (6DoF) camera pose of a given image, building on the Neural Radiance Fields (NeRF) formulation. IFFNeRF is specifically designed to operate in real-time and eliminates the need for an initial pose guess that is proximate to the sought solution. IFFNeRF utilizes the Metropolis-Hasting algorithm to sample surface points from within the NeRF model. From these sampled points, we cast rays and deduce the color for each ray through pixel-level view synthesis. The camera pose can then be estimated as the solution to a Least Squares problem by selecting correspondences between the query image and the resulting bundle. We facilitate this process through a learned attention mechanism, bridging the query image embedding with the embedding of parameterized rays, thereby matching rays pertinent to the image. Through synthetic and real evaluation settings, we show that our method can improve the angular and translation error accuracy by 80.1% and 67.3%, respectively, compared to iNeRF while performing at 34fps on consumer hardware and not requiring the initial pose guess.

IFFNeRF: Initialisation Free and Fast 6DoF pose estimation from a single image and a NeRF model

TL;DR

IFFNeRF is specifically designed to operate in real-time and eliminates the need for an initial pose guess that is proximate to the sought solution, and can improve the angular and translation error accuracy by 80.1% and 67.3%, respectively, compared to iNeRF while performing at 34fps on consumer hardware.

Abstract

We introduce IFFNeRF to estimate the six degrees-of-freedom (6DoF) camera pose of a given image, building on the Neural Radiance Fields (NeRF) formulation. IFFNeRF is specifically designed to operate in real-time and eliminates the need for an initial pose guess that is proximate to the sought solution. IFFNeRF utilizes the Metropolis-Hasting algorithm to sample surface points from within the NeRF model. From these sampled points, we cast rays and deduce the color for each ray through pixel-level view synthesis. The camera pose can then be estimated as the solution to a Least Squares problem by selecting correspondences between the query image and the resulting bundle. We facilitate this process through a learned attention mechanism, bridging the query image embedding with the embedding of parameterized rays, thereby matching rays pertinent to the image. Through synthetic and real evaluation settings, we show that our method can improve the angular and translation error accuracy by 80.1% and 67.3%, respectively, compared to iNeRF while performing at 34fps on consumer hardware and not requiring the initial pose guess.
Paper Structure (12 sections, 9 equations, 6 figures, 1 table)

This paper contains 12 sections, 9 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: From (i) a given image with an unknown pose and (ii) a NeRF model, we recover the pose by first (iii) sampling surface points using Metropolis-Hasting algorithm and (iv) casting rays from them in isocell distribution. We then (iv/v) correlate rays with the image to identify relevant rays using attention and (vii) recover the unknown 6DoF camera pose.
  • Figure 2: Our IFFNeRF takes as input an (i) image, and (ii) a NeRF model. The model $\psi_I$ encodes the image (iii) using a visual backbone. As for the NeRF model, we sample surface points (iv) using Metropolis-Hastings to identify candidates on the surface $\textbf{u}$, which can act as locations to project rays. We uniformly project rays (v) from the center isocells from the points, and estimate the ray corresponding parameters, color, and normal, using the NeRF model. We then embed the ray representation $\psi_r$(vi) and (vii) learn attention $A$ between the ray embedding $\textbf{I}_{fea}$ and $\textbf{R}_{fea}$ to rank rays in relation to the image. We select the top-N rays and estimate the camera location (viii) using least squares, resulting in a 6DoF pose $\hat{\textbf{P}}$ for the image.
  • Figure 3: Illustration of the Isocell ray generation method over a circular domain of a unit disk and unit sphere. The generated points indicate the ray positions within the equally spaced circle cells (we will denote them as "cell centres" for simplicity).
  • Figure 4: Example of generated rays using our approach. The zoomed region highlights the ray cast operation (in this case $27$ rays per isocell).
  • Figure 5: Example of the test-time pose estimation on the Chair object of Synthetic NeRF. (a) Top $N$ (red) vs. top $N$ ground-truth (green) rays. (b) Corresponding score distribution of top $N$ and ground-truth rays.
  • ...and 1 more figures