A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

Axel Barroso-Laguna; Tommaso Cavallari; Victor Adrian Prisacariu; Eric Brachmann

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

Axel Barroso-Laguna, Tommaso Cavallari, Victor Adrian Prisacariu, Eric Brachmann

TL;DR

This work tackles the overhead of visual localization by proposing FastForward, a pipeline that localizes a query image against a sparse, multi-view map represented by $N$ 3D-anchored features sampled from $K$ posed mapping images. A ViT-based encoder with cross-attention predicts 3D query coordinates in a normalized scene, with metric scale recovered via a scene scale factor $s$, enabling pose estimation through $PnP$-RANSAC. A simple scene and scale normalization approach generalizes across datasets with different scales, and a retrieval step selects mapping images to form the map representation. Empirically, FastForward delivers state-of-the-art performance among unseen-methods and strong results against structure-based and SCR methods on outdoor Cambridge, Wayspots, Indoor6, and RIO10 datasets while drastically reducing map preparation time. The method demonstrates robust generalization to unseen domains and maintains fast inference, making it attractive for real-time AR and navigation applications.

Abstract

Visually localizing an image, i.e., estimating its camera pose, requires building a scene representation that serves as a visual map. The representation we choose has direct consequences towards the practicability of our system. Even when starting from mapping images with known camera poses, state-of-the-art approaches still require hours of mapping time in the worst case, and several minutes in the best. This work raises the question whether we can achieve competitive accuracy much faster. We introduce FastForward, a method that creates a map representation and relocalizes a query image on-the-fly in a single feed-forward pass. At the core, we represent multiple mapping images as a collection of features anchored in 3D space. FastForward utilizes these mapping features to predict image-to-scene correspondences for the query image, enabling the estimation of its camera pose. We couple FastForward with image retrieval and achieve state-of-the-art accuracy when compared to other approaches with minimal map preparation time. Furthermore, FastForward demonstrates robust generalization to unseen domains, including challenging large-scale outdoor environments.

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

TL;DR

This work tackles the overhead of visual localization by proposing FastForward, a pipeline that localizes a query image against a sparse, multi-view map represented by

3D-anchored features sampled from

posed mapping images. A ViT-based encoder with cross-attention predicts 3D query coordinates in a normalized scene, with metric scale recovered via a scene scale factor

, enabling pose estimation through

-RANSAC. A simple scene and scale normalization approach generalizes across datasets with different scales, and a retrieval step selects mapping images to form the map representation. Empirically, FastForward delivers state-of-the-art performance among unseen-methods and strong results against structure-based and SCR methods on outdoor Cambridge, Wayspots, Indoor6, and RIO10 datasets while drastically reducing map preparation time. The method demonstrates robust generalization to unseen domains and maintains fast inference, making it attractive for real-time AR and navigation applications.

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

TL;DR

Abstract

A Scene is Worth a Thousand Features: Feed-Forward Camera Localization from a Collection of Image Features

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)