Table of Contents
Fetching ...

PRAM: Place Recognition Anywhere Model for Efficient Visual Localization

Fei Xue, Ignas Budvytis, Roberto Cipolla

TL;DR

PRAM addresses the challenge of efficient, accurate visual localization in large-scale environments by replacing dense 2D-2D matching and global descriptors with a 3D landmark map and sparse landmark recognition using a transformer. It introduces self-supervised 3D landmark generation, a map built from 3D landmarks with virtual reference frames, and a sparse keypoint-based recognition module to produce landmark labels and 2D-3D correspondences for 6-DoF pose estimation. The method achieves competitive accuracy to hierarchical methods while drastically reducing memory usage (over 90% map size reduction) and speeding up test-time performance by roughly 2.4x, demonstrating a favorable accuracy-efficiency trade-off for edge devices. Experiments on indoor datasets (7Scenes, 12Scenes, CambridgeLandmarks) and the Aachen city-scale dataset show strong generalization and practical impact for AR/VR, robotics, and autonomous navigation.

Abstract

Visual localization is a key technique to a variety of applications, e.g., autonomous driving, AR/VR, and robotics. For these real applications, both efficiency and accuracy are important especially on edge devices with limited computing resources. However, previous frameworks, e.g., absolute pose regression (APR), scene coordinate regression (SCR), and the hierarchical method (HM), have limited either accuracy or efficiency in both indoor and outdoor environments. In this paper, we propose the place recognition anywhere model (PRAM), a new framework, to perform visual localization efficiently and accurately by recognizing 3D landmarks. Specifically, PRAM first generates landmarks directly in 3D space in a self-supervised manner. Without relying on commonly used classic semantic labels, these 3D landmarks can be defined in any place in indoor and outdoor scenes with higher generalization ability. Representing the map with 3D landmarks, PRAM discards global descriptors, repetitive local descriptors, and redundant 3D points, increasing the memory efficiency significantly. Then, sparse keypoints, rather than dense pixels, are utilized as the input tokens to a transformer-based recognition module for landmark recognition, which enables PRAM to recognize hundreds of landmarks with high time and memory efficiency. At test time, sparse keypoints and predicted landmark labels are utilized for outlier removal and landmark-wise 2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and SCRs in large-scale scenes with a large margin and gives competitive accuracy to HMs but reduces over 90\% memory cost and runs 2.4 times faster, leading to a better balance between efficiency and accuracy.

PRAM: Place Recognition Anywhere Model for Efficient Visual Localization

TL;DR

PRAM addresses the challenge of efficient, accurate visual localization in large-scale environments by replacing dense 2D-2D matching and global descriptors with a 3D landmark map and sparse landmark recognition using a transformer. It introduces self-supervised 3D landmark generation, a map built from 3D landmarks with virtual reference frames, and a sparse keypoint-based recognition module to produce landmark labels and 2D-3D correspondences for 6-DoF pose estimation. The method achieves competitive accuracy to hierarchical methods while drastically reducing memory usage (over 90% map size reduction) and speeding up test-time performance by roughly 2.4x, demonstrating a favorable accuracy-efficiency trade-off for edge devices. Experiments on indoor datasets (7Scenes, 12Scenes, CambridgeLandmarks) and the Aachen city-scale dataset show strong generalization and practical impact for AR/VR, robotics, and autonomous navigation.

Abstract

Visual localization is a key technique to a variety of applications, e.g., autonomous driving, AR/VR, and robotics. For these real applications, both efficiency and accuracy are important especially on edge devices with limited computing resources. However, previous frameworks, e.g., absolute pose regression (APR), scene coordinate regression (SCR), and the hierarchical method (HM), have limited either accuracy or efficiency in both indoor and outdoor environments. In this paper, we propose the place recognition anywhere model (PRAM), a new framework, to perform visual localization efficiently and accurately by recognizing 3D landmarks. Specifically, PRAM first generates landmarks directly in 3D space in a self-supervised manner. Without relying on commonly used classic semantic labels, these 3D landmarks can be defined in any place in indoor and outdoor scenes with higher generalization ability. Representing the map with 3D landmarks, PRAM discards global descriptors, repetitive local descriptors, and redundant 3D points, increasing the memory efficiency significantly. Then, sparse keypoints, rather than dense pixels, are utilized as the input tokens to a transformer-based recognition module for landmark recognition, which enables PRAM to recognize hundreds of landmarks with high time and memory efficiency. At test time, sparse keypoints and predicted landmark labels are utilized for outlier removal and landmark-wise 2D-3D matching as opposed to exhaustive 2D-2D matching, which further increases the time efficiency. A comprehensive evaluation of APRs, SCRs, HMs, and PRAM on both indoor and outdoor datasets demonstrates that PRAM outperforms ARPs and SCRs in large-scale scenes with a large margin and gives competitive accuracy to HMs but reduces over 90\% memory cost and runs 2.4 times faster, leading to a better balance between efficiency and accuracy.
Paper Structure (18 sections, 6 equations, 10 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 6 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of the PRAM framework. PRAM first reconstructs the 3D map of a scene from reference images and then generates landmarks in 3D space in a self-supervised manner; the recognition module utilizes sparse keypoints sfd22023 extracted from the query image as inputs and predicts corresponding landmark labels with visual transformers; with recognized landmarks, the registration module performs landmark-wise 2D-3D matching to recover the absolute pose of the query image.
  • Figure 2: Landmark generation on the Aachen dataset aachen. 3D points in the map are first projected to the ground plane as 2D projections. Then we employ a hierarchical approach to perform clustering on 2D projections based on spatial connections. Compared with reference images, the landmark map represents the large-scale Aachen city aachen in a compact way with 512 landmarks (best view in color).
  • Figure 3: The structure of map represented by 3D landmarks. A 3D map $\mathcal{M}$ is represented by a number of landmarks $\mathcal{L} = \{L_1,...,L_{\lambda_l}\}$ ($\lambda_l$ is the total number of landmarks). Each landmark $L_i$ contains several 3D points $\mathcal{P}_i=\{P_1,...,P_k\}$ and a virtual reference frame $V_i$. Each 3D point $P_i$ consists of its 3D coordinate $X\in R^{3}$ and descriptor $\mathbf{d}\in R^{128}$.
  • Figure 4: Adaptive landmark-wise 3D point pruning. Taking landmark $L_1$ with 3D points of $\{P_1,...,P_7\}$ and two reference images $F_1, F_2$ for example, we first retain points $P_{1:5}$ as they can be observed by $F_1$. As $P_5$ is already observed by $F_1$ as $p^{1}_5$, its observation on $F_2$ as $p^{2}_5$ is removed. The 2D projection $p^{1}_6$ on frame $F_1$ of $P_6$ has a keypoint $p^{1}_5$ within a circle visible area with radius of $\lambda_o$, so $P_6$ is removed from landmark $L_1$ and its 2D observation $p^{2}_6$ is also removed from $F_2$. The 2D projection $p^{1}_7$ on frame $F_1$ of $P_7$ has no candidates within circle of radius $\lambda_o$, so $P_7$ and $p^{2}_7$ are retained.
  • Figure 5: The original map and the map represented by 3D landmarks. The left shows 3D points and reference images of the map reconstructed with Colmap colmap2016 of Kings College in CambridgeLandmarks posenet. The middle visualizes the 3D points after removing locally inconsistent ones and generated 3D landmarks. The right presents the virtual reference frames (VRFs) and 3D points after adaptive landmark-wise 3D pruning. The number of both images and 3D points are also included.
  • ...and 5 more figures