Table of Contents
Fetching ...

Multiview Image-Based Localization

Cameron Fiore, Hongyi Fan, Benjamin Kimia

TL;DR

The paper targets image-based localization by addressing the accuracy gap in image retrieval methods through a hybrid IR-3D approach. It retrieves top-$K$ anchor images with NetVLAD and computes the camera pose by decoupling translation from orientation, using a closed-form estimate of the camera center that depends only on translation estimates, followed by orientation from relative rotations. A second key contribution introduces latent 3D points and multiview triangulation to refine the query pose directly from feature correspondences, removing the dependency on intermediate relative poses. Across multiple benchmarks including $7$-Scenes, Cambridge Landmarks, Aachen Day-Night, and RobotCar, the method achieves improved localization accuracy and greater efficiency, with favorable timing and memory footprints compared to state-of-the-art baselines. Overall, the work offers a scalable, privacy-conscious localization pipeline suitable for AR and autonomous systems with large image banks, leveraging both robust anchor-based initialization and latent 3D refinement.

Abstract

The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: {\em (i)} a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and {\em (ii)} a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man''. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.

Multiview Image-Based Localization

TL;DR

The paper targets image-based localization by addressing the accuracy gap in image retrieval methods through a hybrid IR-3D approach. It retrieves top- anchor images with NetVLAD and computes the camera pose by decoupling translation from orientation, using a closed-form estimate of the camera center that depends only on translation estimates, followed by orientation from relative rotations. A second key contribution introduces latent 3D points and multiview triangulation to refine the query pose directly from feature correspondences, removing the dependency on intermediate relative poses. Across multiple benchmarks including -Scenes, Cambridge Landmarks, Aachen Day-Night, and RobotCar, the method achieves improved localization accuracy and greater efficiency, with favorable timing and memory footprints compared to state-of-the-art baselines. Overall, the work offers a scalable, privacy-conscious localization pipeline suitable for AR and autonomous systems with large image banks, leveraging both robust anchor-based initialization and latent 3D refinement.

Abstract

The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: {\em (i)} a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and {\em (ii)} a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man''. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.

Paper Structure

This paper contains 6 sections, 2 theorems, 15 equations, 5 figures, 4 tables.

Key Result

Proposition 1

The query camera center $\mathbf{c}_q$ which minimizes the sum of squared distances between $\mathbf{c}_q$ and the rays defined by $\mathbf{c}_k + \lambda_{qk} R_k^T \hat{\mathbf{T}}_{kq}, k=1,2,...,K$ is computed from

Figures (5)

  • Figure -1: The localization of the image query coordinates (red), namely, $(R_q,\mathbf{T}_q)$, with respect to the world coordinates (black) relies on the pre-localized coordinates of neighboring anchor images (blue), namely, $(R_k,\mathbf{T}_k)$, and the relative pose of the query with respect to its neighboring images, $(R_{qk},\mathbf{T}_{qk})$.
  • Figure 0: (a) The camera centers of $I_1$ and $I_2$, depicted above as $\mathbf{c}_1$ and $\mathbf{c}_2$, and the corresponding rays extending outward from them. These rays should intersect at the query camera center but do not in practice. The optimal query camera center can then be taken as the midpoint of the shortest line $\mathbf{p}_1 \mathbf{p}_2$ between these rays as done in triangulation. (b) When using $K$ cameras, each anchor image $I_k$ constrains the location of $\mathbf{c}_q$ to be on the line from $c_k$ to the epipole of $I_q$ on $I_k$, $e_{qk}$. The optimal $\mathbf{c}_q$ minimizes the distances to these loci. Note that this process is completely independent of the orientation of the query camera $R_q$.
  • Figure 1: The localization accuracy of our method is superior to that of Govindu's iterative method govindu under varying degrees of simulated relative pose noise, with the disparity growing with the extent of noise.
  • Figure 2: Example result of image retrieval step. Image (a) is the query image take from the Old Hospital scene of Cambridge Landmarks and image (b) is a $10 \times 15$ grid showing the top 150 most similar images, the vast majority of which are correct.
  • Figure 3: This study on the Great Court scene of Cambridge Landmarks shows that error drops significantly with increasing K up to around 50 images, but continues to decrease with increasing K for the range studied.

Theorems & Definitions (2)

  • Proposition 1
  • Proposition 2