Table of Contents
Fetching ...

Combining Absolute and Semi-Generalized Relative Poses for Visual Localization

Vojtech Panek, Torsten Sattler, Zuzana Kukelova

TL;DR

This work addresses visual localization under imperfect scene geometry by fusing structure-based 2D-3D and structure-less 2D-2D pose estimation through adaptive pose selection. It introduces an RANSAC-based framework that jointly uses a P3P (2D-3D) and an E5+1 (2D-2D) solver, with MSAC-based scoring and optional local refinement to select the best pose per query. Across diverse datasets and representations, the adaptive approach matches or surpasses single-strategy baselines, with notable gains when representations are sparse or geometry is noisy; the method’s success hinges on the scoring function choice and refinement strategy. The approach is practical, scalable, and the authors plan to release code to facilitate real-world adoption and further research.

Abstract

Visual localization is the problem of estimating the camera pose of a given query image within a known scene. Most state-of-the-art localization approaches follow the structure-based paradigm and use 2D-3D matches between pixels in a query image and 3D points in the scene for pose estimation. These approaches assume an accurate 3D model of the scene, which might not always be available, especially if only a few images are available to compute the scene representation. In contrast, structure-less methods rely on 2D-2D matches and do not require any 3D scene model. However, they are also less accurate than structure-based methods. Although one prior work proposed to combine structure-based and structure-less pose estimation strategies, its practical relevance has not been shown. We analyze combining structure-based and structure-less strategies while exploring how to select between poses obtained from 2D-2D and 2D-3D matches, respectively. We show that combining both strategies improves localization performance in multiple practically relevant scenarios.

Combining Absolute and Semi-Generalized Relative Poses for Visual Localization

TL;DR

This work addresses visual localization under imperfect scene geometry by fusing structure-based 2D-3D and structure-less 2D-2D pose estimation through adaptive pose selection. It introduces an RANSAC-based framework that jointly uses a P3P (2D-3D) and an E5+1 (2D-2D) solver, with MSAC-based scoring and optional local refinement to select the best pose per query. Across diverse datasets and representations, the adaptive approach matches or surpasses single-strategy baselines, with notable gains when representations are sparse or geometry is noisy; the method’s success hinges on the scoring function choice and refinement strategy. The approach is practical, scalable, and the authors plan to release code to facilitate real-world adoption and further research.

Abstract

Visual localization is the problem of estimating the camera pose of a given query image within a known scene. Most state-of-the-art localization approaches follow the structure-based paradigm and use 2D-3D matches between pixels in a query image and 3D points in the scene for pose estimation. These approaches assume an accurate 3D model of the scene, which might not always be available, especially if only a few images are available to compute the scene representation. In contrast, structure-less methods rely on 2D-2D matches and do not require any 3D scene model. However, they are also less accurate than structure-based methods. Although one prior work proposed to combine structure-based and structure-less pose estimation strategies, its practical relevance has not been shown. We analyze combining structure-based and structure-less strategies while exploring how to select between poses obtained from 2D-2D and 2D-3D matches, respectively. We show that combining both strategies improves localization performance in multiple practically relevant scenarios.
Paper Structure (9 sections, 6 equations, 10 figures, 7 tables)

This paper contains 9 sections, 6 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Ablating scoring functions on the Cambridge Landmarks dataset Kendall2015PoseNetAC. We report the percentage of images localized within 10 cm and 1 degree of the ground truth. The scene is represented using SfM point clouds computed using every N-th database image.
  • Figure 2: Ablating scoring functions on the Cambridge Landmarks dataset Kendall2015PoseNetAC. We report the percentage of images localized within 10 cm and 1 degree of the ground truth. The scene is represented using a NeRF trained using every N-th database image.
  • Figure 3: Scoring functions ablation study on the 7 Scenes Glocker2013RealtimeRCShotton2013SceneCR dataset. Showing median position and orientation errors (lower is better).
  • Figure 4: Scoring functions ablation study on the 7 Scenes Glocker2013RealtimeRCShotton2013SceneCR dataset. Showing recalls (higher is better) at three different thresholds.
  • Figure 5: Ablating local optimization approaches on the Cambridge Landmarks dataset Kendall2015PoseNetAC. We report the percentage of images localized within 10 cm and 1 degree of the ground truth. The scene is represented using SfM point clouds computed using every N-th database image.
  • ...and 5 more figures