Table of Contents
Fetching ...

MultiLoc: Multi-view Guided Relative Pose Regression for Fast and Robust Visual Re-Localization

Nobel Dang, Bing Li

Abstract

Relative Pose Regression (RPR) generalizes well to unseen environments, but its performance is often limited due to pairwise and local spatial views. To this end, we propose MultiLoc, a novel multi-view guided RPR model trained at scale, equipping relative pose regression with globally consistent spatial and geometric understanding. Specifically, our method jointly fuses multiple reference views and their associated camera poses in a single forward pass, enabling accurate zero-shot pose estimation with real-time efficiency. To reliably supply informative context, we further propose a co-visibility-driven retrieval strategy for geometrically relevant reference view selection. MultiLoc establishes a new benchmark in visual re-localization, consistently outperforming existing state-of-the-art (SOTA) relative pose regression (RPR) methods across diverse datasets, including WaySpots, Cambridge Landmarks, and Indoor6. Furthermore, MultiLoc's pose regressor exhibits SOTA performance in relative pose estimation, surpassing RPR, feature matching and non-regression-based techniques on the MegaDepth-1500, ScanNet-1500, and ACID benchmarks. These results demonstrate robust domain generalization of MultiLoc across indoor, outdoor and natural environments. Code will be made publicly available.

MultiLoc: Multi-view Guided Relative Pose Regression for Fast and Robust Visual Re-Localization

Abstract

Relative Pose Regression (RPR) generalizes well to unseen environments, but its performance is often limited due to pairwise and local spatial views. To this end, we propose MultiLoc, a novel multi-view guided RPR model trained at scale, equipping relative pose regression with globally consistent spatial and geometric understanding. Specifically, our method jointly fuses multiple reference views and their associated camera poses in a single forward pass, enabling accurate zero-shot pose estimation with real-time efficiency. To reliably supply informative context, we further propose a co-visibility-driven retrieval strategy for geometrically relevant reference view selection. MultiLoc establishes a new benchmark in visual re-localization, consistently outperforming existing state-of-the-art (SOTA) relative pose regression (RPR) methods across diverse datasets, including WaySpots, Cambridge Landmarks, and Indoor6. Furthermore, MultiLoc's pose regressor exhibits SOTA performance in relative pose estimation, surpassing RPR, feature matching and non-regression-based techniques on the MegaDepth-1500, ScanNet-1500, and ACID benchmarks. These results demonstrate robust domain generalization of MultiLoc across indoor, outdoor and natural environments. Code will be made publicly available.

Paper Structure

This paper contains 20 sections, 3 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Illustration of improving pose regression through scene context grounding. Estimating camera pose from image pairs can be erroneous especially when there is little-to-no co-visibility between frames. To solve this, our proposed method (right) moves beyond the standard visual-only approach (left). By incorporating multi-view images and existing pose data, we create a sub-scene context that grounds the query image. This additional spatial information provides the necessary cues to stabilize pose estimation even when direct co-visibility is low.
  • Figure 2: MultiLoc Architecture overview. MultiLoc is a visual localization method where query poses are estimated using co-visible retrieved images with known camera poses. The retrieved images’ extrinsics are embedded via an MLP into camera tokens $y$ and added with learnable camera tokens $l$, while all images pass through DINO-v2 oquab2023dinov2 to obtain patch tokens. All the tokens are then concatenated and processed by alternating-attention (AA) transformer blocks. Finally, camera tokens are used to predict relative poses between query and reference images followed by geometric optimization for scale recovery and obtaining global pose for the query image. The red arrows denote the flow of camera tokens.
  • Figure 3: MultiLoc v/s ReLoc3r trajectory error depiction on GreatCourt sequence of Cambridge dataset. Gray dashed lines indicate the translational error between ground truth and predicted poses. MultiLoc achieves significantly lower error by leveraging 3D sub-scene context from multiple images and their pose, whereas ReLoc3r’s reliance on pairwise images lead to higher drift.
  • Figure 4: Visual re-localization result on Indoor6 benchmark. Top row represents the query image and it's corresponding retrieved reference views. Bottom represents the 3D reconstructed scene from Indoor6 dataset and the estimated camera pose for query image. MultiLoc achieves high re-localization accuracy for both translation and rotation components in indoor domain as demonstrated by green colored MultiLoc's camera prediction closeness to orange colored ground truth query pose frustum. Blue frustums represent the supportive reference views.
  • Figure 5: Comparison of reference-view selection strategies. Compared to VPR-based retrieval (top), our co-visibility-based reference set (bottom) exhibits higher 3D surface overlap, or co-visibility, with the query. This geometric consistency and spatial relevance mitigate pose estimation errors, as evidenced by the shorter line length (purple dashed line) for positional error and closer alignment between ground truth and predicted viewing direction rays for rotation. In contrast, VPR-selected views suffer from poor 3D surface overlap, leading to higher angular and positional errors. Orange represents the GT query pose frustum, green for prediction and blue for reference views.
  • ...and 3 more figures