Table of Contents
Fetching ...

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

TL;DR

This work tackles the overhead of visual localization by proposing FastForward, a pipeline that localizes a query image against a sparse, multi-view map represented by $N$ 3D-anchored features sampled from $K$ posed mapping images. A ViT-based encoder with cross-attention predicts 3D query coordinates in a normalized scene, with metric scale recovered via a scene scale factor $s$, enabling pose estimation through $PnP$-RANSAC. A simple scene and scale normalization approach generalizes across datasets with different scales, and a retrieval step selects mapping images to form the map representation. Empirically, FastForward delivers state-of-the-art performance among unseen-methods and strong results against structure-based and SCR methods on outdoor Cambridge, Wayspots, Indoor6, and RIO10 datasets while drastically reducing map preparation time. The method demonstrates robust generalization to unseen domains and maintains fast inference, making it attractive for real-time AR and navigation applications.

Abstract

Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

TL;DR

This work tackles the overhead of visual localization by proposing FastForward, a pipeline that localizes a query image against a sparse, multi-view map represented by 3D-anchored features sampled from posed mapping images. A ViT-based encoder with cross-attention predicts 3D query coordinates in a normalized scene, with metric scale recovered via a scene scale factor , enabling pose estimation through -RANSAC. A simple scene and scale normalization approach generalizes across datasets with different scales, and a retrieval step selects mapping images to form the map representation. Empirically, FastForward delivers state-of-the-art performance among unseen-methods and strong results against structure-based and SCR methods on outdoor Cambridge, Wayspots, Indoor6, and RIO10 datasets while drastically reducing map preparation time. The method demonstrates robust generalization to unseen domains and maintains fast inference, making it attractive for real-time AR and navigation applications.

Abstract

Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: We introduce FastForward, a network that predicts query coordinates in a 3D scene space relative to a collection of mapping images with known poses. FastForward represents the scene as a random set of features sampled from mapping images, and returns the estimate for a query w.r.t. all mapping images in a single feed-forward pass. From left to right, we show how results improve when FastForward uses an increasing number of mapping images, as returned by image retrieval. Note that we always sample the same number of mapping features, and hence, FastForward's query runtime and GPU memory demand remains roughly constant in all three examples.
  • Figure 2: FastForward Architecture. FastForward uses a ViT encoder to compute features of the query, $I^Q$, and the mapping images. To create the map representation $\textrm{M}$, we randomly sample $\textrm{N}$ mapping features. Each mapping feature is augmented with a ray embedding that encodes its camera's position and viewing direction. Mapping poses are normalized by setting one pose to the origin and defining the maximum translation in any direction to one. FastForward performs self- and cross-attention between the query features and the map representation. The query head predicts the 3D coordinates of the query features in the normalized space. The metric scale is recovered by applying the scene scale factor ($s$). The predicted 2D-3D correspondences yield the final query pose ($P_Q$). During training, a mapping head also predicts 3D coordinates for the mapping features, providing additional supervision.
  • Figure 3: Qualitative Examples. The estimated camera pose from FastForward is shown in blue, the ground-truth pose in green, and the mapping camera poses in gray. We visualize the predicted 3D coordinates of the query points, as well as the image patches from which the mapping features are sampled. We use 9 mapping images and a map representation with $\textrm{N=1,000}$ features. FastForward effectively handles symmetries and non-discriminative patterns in the scenes. Besides, since FastForward is agnostic to the scale of the scene, it can accurately predict poses in scenes with arbitrary scales, as demonstrated in the MegaDepth li2018megadepth example (bottom-left).
  • Figure 4: Accuracy vs Number of Mapping Images. We show how the accuracy under the 10cm, 10° and 10cm, 1° threshold changes as we increase the number of mapping images to create the map representation. We fixed the size of the map representation to 768 mapping features.
  • Figure 5: Accuracy vs Number of Mapping Features. We fix the number of mapping images to 20 images and show how the accuracies change as we increase the number of mapping features used to create the map representation of the scene.
  • ...and 1 more figures