Table of Contents
Fetching ...

Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences

Axel Barroso-Laguna, Sowmya Munukutla, Victor Adrian Prisacariu, Eric Brachmann

TL;DR

This work tackles metric relative pose estimation from image pairs in a Map-free relocalization setting, where scale must be recovered without depth measurements. It introduces MicKey, a neural network that regresses 3D keypoint coordinates in camera space from a single image and matches them across views via a probabilistic, differentiable pipeline that includes differentiable RANSAC and a Kabsch solver. Training relies solely on image pairs and their relative poses, achieving state-of-the-art performance on Map-free Relocalisation and strong results on ScanNet, while learning depth implicitly where it matters for matching. The approach enables reliable, scale-aware poses for AR across diverse scenes, including cases with little visual overlap, by integrating object-aware reasoning with soft inlier counting and curriculum learning.

Abstract

Given two images, we can estimate the relative camera pose between them by establishing image-to-image correspondences. Usually, correspondences are 2D-to-2D and the pose we estimate is defined only up to scale. Some applications, aiming at instant augmented reality anywhere, require scale-metric pose estimates, and hence, they rely on external depth estimators to recover the scale. We present MicKey, a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space. By learning to match 3D coordinates across images, we are able to infer the metric relative pose without depth measurements. Depth measurements are also not required for training, nor are scene reconstructions or image overlap information. MicKey is supervised only by pairs of images and their relative poses. MicKey achieves state-of-the-art performance on the Map-Free Relocalisation benchmark while requiring less supervision than competing approaches.

Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences

TL;DR

This work tackles metric relative pose estimation from image pairs in a Map-free relocalization setting, where scale must be recovered without depth measurements. It introduces MicKey, a neural network that regresses 3D keypoint coordinates in camera space from a single image and matches them across views via a probabilistic, differentiable pipeline that includes differentiable RANSAC and a Kabsch solver. Training relies solely on image pairs and their relative poses, achieving state-of-the-art performance on Map-free Relocalisation and strong results on ScanNet, while learning depth implicitly where it matters for matching. The approach enables reliable, scale-aware poses for AR across diverse scenes, including cases with little visual overlap, by integrating object-aware reasoning with soft inlier counting and curriculum learning.

Abstract

Given two images, we can estimate the relative camera pose between them by establishing image-to-image correspondences. Usually, correspondences are 2D-to-2D and the pose we estimate is defined only up to scale. Some applications, aiming at instant augmented reality anywhere, require scale-metric pose estimates, and hence, they rely on external depth estimators to recover the scale. We present MicKey, a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space. By learning to match 3D coordinates across images, we are able to infer the metric relative pose without depth measurements. Depth measurements are also not required for training, nor are scene reconstructions or image overlap information. MicKey is supervised only by pairs of images and their relative poses. MicKey achieves state-of-the-art performance on the Map-Free Relocalisation benchmark while requiring less supervision than competing approaches.
Paper Structure (19 sections, 12 equations, 9 figures, 11 tables)

This paper contains 19 sections, 12 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: We introduce MicKey, a neural network that predicts 3D metric keypoint coordinates in camera space from a 2D input image. Given two images, MicKey establishes 3D-3D correspondences via descriptor matching and then applies a Kabsch kabsch1976solution solver to recover the metric relative pose. We visualize the 3D keypoint coordinates by mapping them to the RGB cube.
  • Figure 2: Training pipeline. MicKey predicts 3D coordinates of keypoints in camera space. The network also predicts keypoint selection probabilities (keypoint distribution) and descriptors that steer the probabilities of matches (matching distribution). The combination of both distributions yields the probability of two keypoint being a correspondence in $P_{I \leftrightarrow I'}$, and we optimize the network such that correct correspondences are more likely. Within a differentiable RANSAC loop, we generate multiple relative pose hypotheses and compute their loss w.r.t. to the ground truth transformation, $\hat{h}$. We generate gradients to train the correspondence probabilities $P_{I \leftrightarrow I'}$ via REINFORCE. Since our pose solver and loss function are differentiable, backpropagation also provides a direct signal to train the 3D keypoint coordinates.
  • Figure 3: MicKey Architecture. MicKey uses a feature extractor that splits the image into patches. For every patch, MicKey computes a 2D offset, a keypoint confidence, a depth value, and a descriptor vector. The 3D keypoint coordinates are obtained by the absolute position of the patch, its 2D offset, and depth value.
  • Figure 4: Example of correspondences, scores and depth maps generated by MicKey. MicKey finds valid correspondences even under large-scale changes or wide baselines. Note that the depth maps have a resolution 14 times smaller than the input images due to our feature encoder. We follow the visualization of depth maps used in DPT ranftl2021vision, where brighter means closer.
  • Figure 5: MicKey establishes keypoint correspondences even though images share very little visual overlap. Contrary to other matchers, MicKey does not focus on the textured wall, but instead reasons about the shape of the object in the foreground.
  • ...and 4 more figures