Table of Contents
Fetching ...

LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals

Arjun Karpur, Guilherme Perrotta, Ricardo Martin-Brualla, Howard Zhou, André Araujo

TL;DR

LFM-3D tackles wide-baseline local feature matching by augmenting a graph neural network matcher with per-keypoint 3D signals, leveraging either class-specific Normalized Object Coordinates ($NOCS$) or monocular depth estimates ($MDE$). A Fourier-based 3D positional encoding enables effective integration of low-dimensional 3D cues, producing 3D-infused embeddings that improve correspondence quality. The method is trained in two stages—2D-only pretraining on synthetic data followed by joint finetuning with 3D signals—and demonstrates up to +6% recall and +28% precision gains, as well as significant improvements in relative pose estimation on in-the-wild data. These results show that incorporating structured 3D knowledge into learnable matchers enhances robustness under wide baselines, with practical benefits for 3D reconstruction and pose estimation tasks across diverse object categories.

Abstract

Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning-based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. Additionally, we demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs - up to 8.6% compared to the 2D-only approach.

LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals

TL;DR

LFM-3D tackles wide-baseline local feature matching by augmenting a graph neural network matcher with per-keypoint 3D signals, leveraging either class-specific Normalized Object Coordinates () or monocular depth estimates (). A Fourier-based 3D positional encoding enables effective integration of low-dimensional 3D cues, producing 3D-infused embeddings that improve correspondence quality. The method is trained in two stages—2D-only pretraining on synthetic data followed by joint finetuning with 3D signals—and demonstrates up to +6% recall and +28% precision gains, as well as significant improvements in relative pose estimation on in-the-wild data. These results show that incorporating structured 3D knowledge into learnable matchers enhances robustness under wide baselines, with practical benefits for 3D reconstruction and pose estimation tasks across diverse object categories.

Abstract

Finding localized correspondences across different images of the same object is crucial to understand its geometry. In recent years, this problem has seen remarkable progress with the advent of deep learning-based local image features and learnable matchers. Still, learnable matchers often underperform when there exists only small regions of co-visibility between image pairs (i.e. wide camera baselines). To address this problem, we leverage recent progress in coarse single-view geometry estimation methods. We propose LFM-3D, a Learnable Feature Matching framework that uses models based on graph neural networks and enhances their capabilities by integrating noisy, estimated 3D signals to boost correspondence estimation. When integrating 3D signals into the matcher model, we show that a suitable positional encoding is critical to effectively make use of the low-dimensional 3D information. We experiment with two different 3D signals - normalized object coordinates and monocular depth estimates - and evaluate our method on large-scale (synthetic and real) datasets containing object-centric image pairs across wide baselines. We observe strong feature matching improvements compared to 2D-only methods, with up to +6% total recall and +28% precision at fixed recall. Additionally, we demonstrate that the resulting improved correspondences lead to much higher relative posing accuracy for in-the-wild image pairs - up to 8.6% compared to the 2D-only approach.
Paper Structure (18 sections, 3 equations, 7 figures, 4 tables)

This paper contains 18 sections, 3 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: We propose LFM-3D, a novel learnable method for local feature matching leveraging 3D information. Infusing local feature matching with 3D signals enables accurate estimation of correspondences across very wide baselines, where conventional methods (SIFT + ratio test Lowe2004) and even recent ones (SuperPoint + SuperGlue sarlin2020superglue) fail -- we represent correct matches with green lines, and incorrect ones with red lines. Here, our method incorporates 3D normalized object coordinates as part of a graph neural network matcher, which significantly boosts the feature association process.
  • Figure 2: Block diagram of an instance of the proposed LFM-3D system, which uses NOCS wang2019normalized maps for 3D signals. We extract local features and normalized object coordinates (NOCS) from each image. The NOCS maps are visualized by mapping XYZ to RGB. The NOCS 3D coordinates undergo positional encoding and are combined with local features in order to generate 3D-infused local feature embeddings. A graph neural network is then applied on these to propose correspondences. Our method can find correspondences across images under very wide baselines, thanks to the 3D information leveraged from NOCS. Besides NOCS, in this work we also instantiate the LFM-3D model with monocular depth estimates (MDE), which would follow the same process by changing NOCS maps to MDE maps.
  • Figure 3: Qualitative results of our trained NOCS model on Objectron. Top: input image, middle: segmentation from off-the-shelf instance segmenter, bottom: NOCS rendering with axis aligned grid overlaid. Note that the NOCS model was trained only on synthetic renderings from Google Scanned Objects.
  • Figure 4: Correspondence-level precision/recall curves for ablated version of class-specific LFM-3D on the Google Scanned Objects evaluation datasets.
  • Figure 5: Qualitative results for our LFM-3D method. We show predicted correspondences with confidence threshold $0.2$. An absence of correspondence lines means that the model found no matches. (a) Correct matches are shown in green and incorrect matches ($>3$ pixel error) are shown in red. (b) & (c) Ground truth isn't available, so we show matches in randomized colors.
  • ...and 2 more figures