Table of Contents
Fetching ...

Visible Structure Retrieval for Lightweight Image-Based Relocalisation

Fereidoon Zangeneh, Leonard Bruns, Amit Dekel, Alessandro Pieropan, Patric Jensfelt

TL;DR

This work addresses the scalability challenges of structure-based camera relocalisation in large scenes by replacing image retrieval and heuristic search with a learned, scene-specific regression of the visible 3D structure from a query image. Framed as a conditional variational autoencoder, the approach (ViStR) trains a small network to map image observations to a posterior over visible SfM points, enabling efficient retrieval via a radius search and robust 2D–3D matching with PnP-RANSAC. The method achieves accuracy on par with state-of-the-art structure-based baselines while reducing storage and computation, and delivers fast localisation (sub-100 ms) that scales with scene complexity rather than the number of past observations. Overall, ViStR provides a practical, scalable alternative to image retrieval for large-scale relocalisation, with potential for descriptor-free extensions in the future.

Abstract

Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.

Visible Structure Retrieval for Lightweight Image-Based Relocalisation

TL;DR

This work addresses the scalability challenges of structure-based camera relocalisation in large scenes by replacing image retrieval and heuristic search with a learned, scene-specific regression of the visible 3D structure from a query image. Framed as a conditional variational autoencoder, the approach (ViStR) trains a small network to map image observations to a posterior over visible SfM points, enabling efficient retrieval via a radius search and robust 2D–3D matching with PnP-RANSAC. The method achieves accuracy on par with state-of-the-art structure-based baselines while reducing storage and computation, and delivers fast localisation (sub-100 ms) that scales with scene complexity rather than the number of past observations. Overall, ViStR provides a practical, scalable alternative to image retrieval for large-scale relocalisation, with potential for descriptor-free extensions in the future.

Abstract

Accurate camera pose estimation from an image observation in a previously mapped environment is commonly done through structure-based methods: by finding correspondences between 2D keypoints on the image and 3D structure points in the map. In order to make this correspondence search tractable in large scenes, existing pipelines either rely on search heuristics, or perform image retrieval to reduce the search space by comparing the current image to a database of past observations. However, these approaches result in elaborate pipelines or storage requirements that grow with the number of past observations. In this work, we propose a new paradigm for making structure-based relocalisation tractable. Instead of relying on image retrieval or search heuristics, we learn a direct mapping from image observations to the visible scene structure in a compact neural network. Given a query image, a forward pass through our novel visible structure retrieval network allows obtaining the subset of 3D structure points in the map that the image views, thus reducing the search space of 2D-3D correspondences. We show that our proposed method enables performing localisation with an accuracy comparable to the state of the art, while requiring lower computational and storage footprint.

Paper Structure

This paper contains 11 sections, 2 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Our proposed method, visbile structure retrieval, retrieves the subset of SfM points that are visible per image observation. This lightweight setup serves to reduce the search space for establishing 2D-3D correspondences in a structure-based localisation pipeline, enabling fast and accurate localisation in large scenes.
  • Figure 2: (a) Our visible structure retrieval network is trained as the decoder of a VAE pipeline that, for an image observation $\boldsymbol{x}$, reconstructs its visible 3D structure points $\boldsymbol{y} \in \mathbb{R}^3$. Specifically, given image-level features of an observation $\boldsymbol{x}$, each structure point $\boldsymbol{y}$ from the SfM map that is visible in that image is encoded to its unique latent posterior $q(\boldsymbol{z} \mid \boldsymbol{x}, \boldsymbol{y})$, while the decoder is tasked with decoding latent samples $\boldsymbol{z} \in \mathbb{R}^d$ from this posterior back to the original point. At the same time, latent posteriors of different points visible in $\boldsymbol{x}$ are constrained to collectively conform to the prior $p(\boldsymbol{z}) = \mathcal{N}(\mathbf{0}, \mathbf{1})$. This training scheme organises the latent space so that it can be interpreted as the space of visible structure per image observation $\boldsymbol{x}$. (b) At inference time, the decoder maps noise samples from the prior distribution $\mathcal{N}(\mathbf{0}, \mathbf{1})$ to different regions of the scene structure visible in the observed image.
  • Figure 3: The novel is at the heart of our localisation pipeline: given image-level features of a query $\boldsymbol{x}$, it predicts the posterior distribution over visible structure points $p(\boldsymbol{y} \mid \boldsymbol{x})$. This is done through a forward pass of an arbitrarily large set of noise samples together with the image-level features of $\boldsymbol{x}$ to get a set of regressed 3D points $\hat{\mathcal{Y}}$. A radius search in the SfM point cloud's k-d tree around elements of $\hat{\mathcal{Y}}$ then retrieves a submap $\tilde{\mathcal{Y}}$ from the full SfM map. This effectively confines the search space of 2D-3D matches in large scenes, such that nearest neighbour matching of descriptors for query keypoints and $\tilde{\mathcal{Y}}$ points yields sufficient inliers for PnP to recover the query camera pose.
  • Figure 4: Our approach effectively retrieves the set of 3D points observed in a query image.