Multiview Image-Based Localization
Cameron Fiore, Hongyi Fan, Benjamin Kimia
TL;DR
The paper targets image-based localization by addressing the accuracy gap in image retrieval methods through a hybrid IR-3D approach. It retrieves top-$K$ anchor images with NetVLAD and computes the camera pose by decoupling translation from orientation, using a closed-form estimate of the camera center that depends only on translation estimates, followed by orientation from relative rotations. A second key contribution introduces latent 3D points and multiview triangulation to refine the query pose directly from feature correspondences, removing the dependency on intermediate relative poses. Across multiple benchmarks including $7$-Scenes, Cambridge Landmarks, Aachen Day-Night, and RobotCar, the method achieves improved localization accuracy and greater efficiency, with favorable timing and memory footprints compared to state-of-the-art baselines. Overall, the work offers a scalable, privacy-conscious localization pipeline suitable for AR and autonomous systems with large image banks, leveraging both robust anchor-based initialization and latent 3D refinement.
Abstract
The image retrieval (IR) approach to image localization has distinct advantages to the 3D and the deep learning (DNN) approaches: it is seen-agnostic, simpler to implement and use, has no privacy issues, and is computationally efficient. The main drawback of this approach is relatively poor localization in both position and orientation of the query camera when compared to the competing approaches. This paper represents a hybrid approach that stores only image features in the database like some IR methods, but relies on a latent 3D reconstruction, like 3D methods but without retaining a 3D scene reconstruction. The approach is based on two ideas: {\em (i)} a novel proposal where query camera center estimation relies only on relative translation estimates but not relative rotation estimates through a decoupling of the two, and {\em (ii)} a shift from computing optimal pose from estimated relative pose to computing optimal pose from multiview correspondences, thus cutting out the ``middle-man''. Our approach shows improved performance on the 7-Scenes and Cambridge Landmarks datasets while also improving on timing and memory footprint as compared to state-of-the-art.
