LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals
Arjun Karpur, Guilherme Perrotta, Ricardo Martin-Brualla, Howard Zhou, André Araujo
TL;DR
LFM-3D tackles wide-baseline local feature matching by augmenting a graph neural network matcher with per-keypoint 3D signals, leveraging either class-specific Normalized Object Coordinates ($NOCS$) or monocular depth estimates ($MDE$). A Fourier-based 3D positional encoding enables effective integration of low-dimensional 3D cues, producing 3D-infused embeddings that improve correspondence quality. The method is trained in two stages—2D-only pretraining on synthetic data followed by joint finetuning with 3D signals—and demonstrates up to +6% recall and +28% precision gains, as well as significant improvements in relative pose estimation on in-the-wild data. These results show that incorporating structured 3D knowledge into learnable matchers enhances robustness under wide baselines, with practical benefits for 3D reconstruction and pose estimation tasks across diverse object categories.
Abstract
Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning-based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. Additionally, we demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs - up to 8.6% compared to the 2D-only approach.
