Improved Scene Landmark Detection for Camera Localization
Tien Do, Sudipta N. Sinha
TL;DR
The paper tackles efficient, privacy-preserving camera localization by reexamining scene landmark detection (SLD). It identifies two bottlenecks—insufficient model capacity and noisy training labels—in the original SLD and introduces SLD$^*$, an ensemble-and-labels approach: partition landmarks into non-overlapping groups, train multiple lightweight detectors, improve training labels with dense visibility via dense reconstructions, and use a compact heatmap-based architecture with weighted pose estimation. This combination yields accuracy on par with state-of-the-art structure-based methods on Indoor-6, while being significantly faster and more storage-efficient. The work also demonstrates that larger landmark budgets and carefully designed ensembles can substantially boost performance, and it provides code and model availability to support adoption. Overall, SLD$^*$ offers a scalable, efficient alternative for real-time, privacy-conscious localization in indoor environments.
Abstract
Camera localization methods based on retrieval, local feature matching, and 3D structure-based pose estimation are accurate but require high storage, are slow, and are not privacy-preserving. A method based on scene landmark detection (SLD) was recently proposed to address these limitations. It involves training a convolutional neural network (CNN) to detect a few predetermined, salient, scene-specific 3D points or landmarks and computing camera pose from the associated 2D-3D correspondences. Although SLD outperformed existing learning-based approaches, it was notably less accurate than 3D structure-based methods. In this paper, we show that the accuracy gap was due to insufficient model capacity and noisy labels during training. To mitigate the capacity issue, we propose to split the landmarks into subgroups and train a separate network for each subgroup. To generate better training labels, we propose using dense reconstructions to estimate visibility of scene landmarks. Finally, we present a compact architecture to improve memory efficiency. Accuracy wise, our approach is on par with state of the art structure based methods on the INDOOR-6 dataset but runs significantly faster and uses less storage. Code and models can be found at https://github.com/microsoft/SceneLandmarkLocalization.
