Combining Absolute and Semi-Generalized Relative Poses for Visual Localization
Vojtech Panek, Torsten Sattler, Zuzana Kukelova
TL;DR
This work addresses visual localization under imperfect scene geometry by fusing structure-based 2D-3D and structure-less 2D-2D pose estimation through adaptive pose selection. It introduces an RANSAC-based framework that jointly uses a P3P (2D-3D) and an E5+1 (2D-2D) solver, with MSAC-based scoring and optional local refinement to select the best pose per query. Across diverse datasets and representations, the adaptive approach matches or surpasses single-strategy baselines, with notable gains when representations are sparse or geometry is noisy; the method’s success hinges on the scoring function choice and refinement strategy. The approach is practical, scalable, and the authors plan to release code to facilitate real-world adoption and further research.
Abstract
Visual localization is the problem of estimating the camera pose of a given query image within a known scene. Most state-of-the-art localization approaches follow the structure-based paradigm and use 2D-3D matches between pixels in a query image and 3D points in the scene for pose estimation. These approaches assume an accurate 3D model of the scene, which might not always be available, especially if only a few images are available to compute the scene representation. In contrast, structure-less methods rely on 2D-2D matches and do not require any 3D scene model. However, they are also less accurate than structure-based methods. Although one prior work proposed to combine structure-based and structure-less pose estimation strategies, its practical relevance has not been shown. We analyze combining structure-based and structure-less strategies while exploring how to select between poses obtained from 2D-2D and 2D-3D matches, respectively. We show that combining both strategies improves localization performance in multiple practically relevant scenarios.
