Table of Contents
Fetching ...

TP3M: Transformer-based Pseudo 3D Image Matching with Reference Image

Liming Han, Zhaoxiang Liu, Shiguo Lian

TL;DR

TP3M tackles robust image matching under challenging conditions by introducing a reference image to enrich source 2D features into pseudo 3D representations and performing coarse-to-fine 3D matching with Transformer modules. The method integrates 2D edge feature detection, 2D feature matching, pseudo 3D feature extraction, and pseudo 3D matching within a ViT framework, supervised by combined losses that leverage Canny edges and SfM ground truth. Experimental results across HPatches, ScanNet, MegaDepth, Aachen Day-Night, and InLoc show that TP3M achieves state-of-the-art performance in homography estimation, pose estimation, and visual localization, especially in visually challenging scenes. The findings highlight the value of geometry-aware 3D features derived from a reference view for improving cross-view correspondence and downstream localization tasks.

Abstract

Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.

TP3M: Transformer-based Pseudo 3D Image Matching with Reference Image

TL;DR

TP3M tackles robust image matching under challenging conditions by introducing a reference image to enrich source 2D features into pseudo 3D representations and performing coarse-to-fine 3D matching with Transformer modules. The method integrates 2D edge feature detection, 2D feature matching, pseudo 3D feature extraction, and pseudo 3D matching within a ViT framework, supervised by combined losses that leverage Canny edges and SfM ground truth. Experimental results across HPatches, ScanNet, MegaDepth, Aachen Day-Night, and InLoc show that TP3M achieves state-of-the-art performance in homography estimation, pose estimation, and visual localization, especially in visually challenging scenes. The findings highlight the value of geometry-aware 3D features derived from a reference view for improving cross-view correspondence and downstream localization tasks.

Abstract

Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.
Paper Structure (18 sections, 9 equations, 4 figures, 6 tables)

This paper contains 18 sections, 9 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison between TP3M and MatchFormer. Seen from the challenging image pairs with large viewpoint and illumination changes on the Aachen-Day-Night dataset, matching with TP3M results in more accurate poses than MatchFormer (correspondences colored by red lines represent epipolar error at 0.5m,5°).
  • Figure 2: Overview of the proposed TP3M. It includes four key modules: Transformer-based self-attention for 2D edge feature detection(\ref{['subsec:2D Edge Feature Detection']}); Transformer-based cross-attention for 2D feature matching (\ref{['subsec:2D Feature Matching']}); Pseudo 3D feature extraction (\ref{['subsec:Pseudo 3D Feature Extraction']}); Coarse-to-fine pseudo 3D matching between 2D features and 3D features(\ref{['subsec:Pseudo 3D Matching']}).
  • Figure 3: 2D edge feature detection with $I_A$ as example.
  • Figure 4: Visualizing attention. A: 2D self-attention in source image, B: 3D self-attention in source image, C: 2D self-attention in destination image. D : 2D-2D cross-attention, E : 2D-3D cross-attention between source and destination image.