Table of Contents
Fetching ...

A Guide to Structureless Visual Localization

Vojtech Panek, Qunjie Zhou, Yaqing Ding, Sérgio Agostinho, Zuzana Kukelova, Torsten Sattler, Laura Leal-Taixé

TL;DR

This paper surveys structureless visual localization methods, contrasting them with traditional structure-based approaches that rely on explicit 3D scene models. It provides a comprehensive review and an extensive experimental comparison across families such as pose triangulation, semi-generalized relative pose estimation, local SfM on the fly, and relative pose regression, using datasets like Aachen Day-Night, Extended CMU Seasons, and NAVER indoor scenes. The findings show that approaches with stronger geometric reasoning achieve higher pose accuracy, with local SfM on the fly delivering the best results, while semi-generalized relative pose estimation offers the best accuracy–runtime trade-off; regression-based methods remain behind geometry-based methods. Overall, structureless methods can be competitive with structure-based methods, offering flexibility and ease of scene updates, and the results point to promising directions for improving accuracy while maintaining efficiency.

Abstract

Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.

A Guide to Structureless Visual Localization

TL;DR

This paper surveys structureless visual localization methods, contrasting them with traditional structure-based approaches that rely on explicit 3D scene models. It provides a comprehensive review and an extensive experimental comparison across families such as pose triangulation, semi-generalized relative pose estimation, local SfM on the fly, and relative pose regression, using datasets like Aachen Day-Night, Extended CMU Seasons, and NAVER indoor scenes. The findings show that approaches with stronger geometric reasoning achieve higher pose accuracy, with local SfM on the fly delivering the best results, while semi-generalized relative pose estimation offers the best accuracy–runtime trade-off; regression-based methods remain behind geometry-based methods. Overall, structureless methods can be competitive with structure-based methods, offering flexibility and ease of scene updates, and the results point to promising directions for improving accuracy while maintaining efficiency.

Abstract

Visual localization algorithms, i.e., methods that estimate the camera pose of a query image in a known scene, are core components of many applications, including self-driving cars and augmented / mixed reality systems. State-of-the-art visual localization algorithms are structure-based, i.e., they store a 3D model of the scene and use 2D-3D correspondences between the query image and 3D points in the model for camera pose estimation. While such approaches are highly accurate, they are also rather inflexible when it comes to adjusting the underlying 3D model after changes in the scene. Structureless localization approaches represent the scene as a database of images with known poses and thus offer a much more flexible representation that can be easily updated by adding or removing images. Although there is a large amount of literature on structure-based approaches, there is significantly less work on structureless methods. Hence, this paper is dedicated to providing the, to the best of our knowledge, first comprehensive discussion and comparison of structureless methods. Extensive experiments show that approaches that use a higher degree of classical geometric reasoning generally achieve higher pose accuracy. In particular, approaches based on classical absolute or semi-generalized relative pose estimation outperform very recent methods based on pose regression by a wide margin. Compared with state-of-the-art structure-based approaches, the flexibility of structureless methods comes at the cost of (slightly) lower pose accuracy, indicating an interesting direction for future work.

Paper Structure

This paper contains 11 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Comparison of depth maps from different sources - on the top left is the source image. The corresponding source camera was used for the rendering of a mesh model (AC-14 model from MeshLoc Panek2022ECCV). The source image together with its focal length is the sole input into the Metric3D v2 yin2023metricHu2024Metric3DVA monocular depth estimator. As MASt3R dust3r_cvpr24MASt3R_eccv24 is a stereo model, it also uses a second image (shown in the bottom left) to predict the 3D geometry. MASt3R performs the prediction without any knowledge about the camera parameters. Both Metric3D and MASt3R depth maps were aligned (in scale and shift) to the mesh depth map for easier comparability, while they are used in their raw unscaled form in the experiments.
  • Figure 2: Localization results for the Ess. mat. (5Pt) approach for different features. We report localization recalls (higher is better) on the Y-axis at multiple pose thresholds (X-axis). For the outdoor scenes, the best results are obtained with the RoMa matcher. For the indoor scenes, the MASt3R matcher performs best for the coarser thresholds.
  • Figure 3: Localization results for the Ess. mat. (3Pt + depth) approach for different features and monocular depth predictors. We report localization recalls (higher is better) on the Y-axis at multiple pose thresholds (X-axis). For most scenes, the choice of the depth predictor is not critical. For outdoor scenes, RoMa yields the best results. For indoor scenes, MASt3R leads to the highest pose accuracy in most cases.
  • Figure 4: LazyLoc localization results for different features. We report localization recalls (higher is better) on the Y-axis at multiple pose thresholds (X-axis). There is no type of feature that performs best in all scenes. However, the MASt3R matcher performs well in general.
  • Figure 5: E5+1 localization results for different features. We report localization recalls (higher is better) on the Y-axis at multiple pose thresholds (X-axis). For the outdoor scenes, the best results are typically obtained with the RoMa matcher. For the indoor scenes, the MASt3R matcher performs best for the coarser thresholds.
  • ...and 4 more figures