Visible Structure Retrieval for Lightweight Image-Based Relocalisation
Fereidoon Zangeneh, Leonard Bruns, Amit Dekel, Alessandro Pieropan, Patric Jensfelt
TL;DR
This work addresses the scalability challenges of structure-based camera relocalisation in large scenes by replacing image retrieval and heuristic search with a learned, scene-specific regression of the visible 3D structure from a query image. Framed as a conditional variational autoencoder, the approach (ViStR) trains a small network to map image observations to a posterior over visible SfM points, enabling efficient retrieval via a radius search and robust 2D–3D matching with PnP-RANSAC. The method achieves accuracy on par with state-of-the-art structure-based baselines while reducing storage and computation, and delivers fast localisation (sub-100 ms) that scales with scene complexity rather than the number of past observations. Overall, ViStR provides a practical, scalable alternative to image retrieval for large-scale relocalisation, with potential for descriptor-free extensions in the future.
Abstract
Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.
