Table of Contents
Fetching ...

Improved Scene Landmark Detection for Camera Localization

Tien Do, Sudipta N. Sinha

TL;DR

The paper tackles efficient, privacy-preserving camera localization by reexamining scene landmark detection (SLD). It identifies two bottlenecks—insufficient model capacity and noisy training labels—in the original SLD and introduces SLD$^*$, an ensemble-and-labels approach: partition landmarks into non-overlapping groups, train multiple lightweight detectors, improve training labels with dense visibility via dense reconstructions, and use a compact heatmap-based architecture with weighted pose estimation. This combination yields accuracy on par with state-of-the-art structure-based methods on Indoor-6, while being significantly faster and more storage-efficient. The work also demonstrates that larger landmark budgets and carefully designed ensembles can substantially boost performance, and it provides code and model availability to support adoption. Overall, SLD$^*$ offers a scalable, efficient alternative for real-time, privacy-conscious localization in indoor environments.

Abstract

Camera localization methods based on retrieval, local feature matching, and 3D structure-based pose estimation are accurate but require high storage, are slow, and are not privacy-preserving. A method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks and computing camera pose from the associated 2D-3D correspondences. Although SLD outperformed existing learning-based approaches, it was notably less accurate than 3D structure-based methods. In this paper, we show that the accuracy gap was due to insufficient model capacity and noisy labels during training. To mitigate the capacity issue, we propose to split the landmarks into subgroups and train a separate network for each subgroup. To generate better training labels, we propose using dense reconstructions to estimate visibility of scene landmarks. Finally, we present a compact architecture to improve memory efficiency. Accuracy wise, our approach is on par with state of the art structure based methods on the INDOOR-6 dataset but runs significantly faster and uses less storage. Code and models can be found at https://github.com/microsoft/SceneLandmarkLocalization.

Improved Scene Landmark Detection for Camera Localization

TL;DR

The paper tackles efficient, privacy-preserving camera localization by reexamining scene landmark detection (SLD). It identifies two bottlenecks—insufficient model capacity and noisy training labels—in the original SLD and introduces SLD, an ensemble-and-labels approach: partition landmarks into non-overlapping groups, train multiple lightweight detectors, improve training labels with dense visibility via dense reconstructions, and use a compact heatmap-based architecture with weighted pose estimation. This combination yields accuracy on par with state-of-the-art structure-based methods on Indoor-6, while being significantly faster and more storage-efficient. The work also demonstrates that larger landmark budgets and carefully designed ensembles can substantially boost performance, and it provides code and model availability to support adoption. Overall, SLD offers a scalable, efficient alternative for real-time, privacy-conscious localization in indoor environments.

Abstract

Camera localization methods based on retrieval, local feature matching, and 3D structure-based pose estimation are accurate but require high storage, are slow, and are not privacy-preserving. A method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks and computing camera pose from the associated 2D-3D correspondences. Although SLD outperformed existing learning-based approaches, it was notably less accurate than 3D structure-based methods. In this paper, we show that the accuracy gap was due to insufficient model capacity and noisy labels during training. To mitigate the capacity issue, we propose to split the landmarks into subgroups and train a separate network for each subgroup. To generate better training labels, we propose using dense reconstructions to estimate visibility of scene landmarks. Finally, we present a compact architecture to improve memory efficiency. Accuracy wise, our approach is on par with state of the art structure based methods on the INDOOR-6 dataset but runs significantly faster and uses less storage. Code and models can be found at https://github.com/microsoft/SceneLandmarkLocalization.
Paper Structure (9 sections, 1 equation, 6 figures, 4 tables)

This paper contains 9 sections, 1 equation, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Key elements of the scene landmark detection-based localization approach Do2022. The figure shows a single model (SLD) for brevity, but Do et al. Do2022 also proposed predicting landmark bearings using an additional model (NBE). This is discussed in the text.
  • Figure 2: [Top] The original SLD architecture Do2022. [Bottom] An illustration of the proposed SLD$\ast$ architecture (see text for details).
  • Figure 3: Better Visibility Estimation. [Left] Two images from scene1 in the Indoor-6 dataset taken at different times of day and a rendering of the dense 3D mesh reconstruction of the scene. [Right] On the top right, we show a single row of patches depicting a scene landmark (indicated by the green square) in different images where the landmark was found to be visible. The original method leveraged data association from only structure from motion. On the lower right, we show patches for the same landmark based the proposed visibility estimation approach that also uses the dense mesh reconstruction (see text for details). The high appearance diversity in the observed patches under varying illumination makes the trained landmark detector more robust.
  • Figure 4: The top view of the mesh and 3D SfM point cloud from scene1, shown with the overlaid scene landmarks (red points). The sets of 300 and 1000 landmarks respectively are both computed by the existing selection method Do2022. The image on the right shows that a higher number of landmarks provides denser scene coverage. We show later that it leads to an improvement in camera pose accuracy.
  • Figure 5: Accuracy/speed tradeoff of SLD$\ast$ and hloc. The plot shows how hloc's performance varies with the number of matched image pairs. Tthe number of pairs were set to 1, 2, 5, 10, 15 and 20 respectively, as denoted by the text labels). hloc's best accuracy was 71.4% with 20 image pairs for which the timing was 14.2 seconds/image. Similarly, seven SLD$\ast$ configurations were evaluated. The text label a $\times$ b next to the blue dots indicate the SLD$\ast$ configuration, where a is the number of landmarks in each partition and b represents the number of partitions. SLD$\ast$'s best result was 70.1% using 125 $\times$ 8 = 1000 landmarks with a running time of 0.3 seconds/image. The plot shows that accuracy wise, SLD$\ast$'s best configuration is competitive with hloc but more than 40X faster.
  • ...and 1 more figures