TP3M: Transformer-based Pseudo 3D Image Matching with Reference Image
Liming Han, Zhaoxiang Liu, Shiguo Lian
TL;DR
TP3M tackles robust image matching under challenging conditions by introducing a reference image to enrich source 2D features into pseudo 3D representations and performing coarse-to-fine 3D matching with Transformer modules. The method integrates 2D edge feature detection, 2D feature matching, pseudo 3D feature extraction, and pseudo 3D matching within a ViT framework, supervised by combined losses that leverage Canny edges and SfM ground truth. Experimental results across HPatches, ScanNet, MegaDepth, Aachen Day-Night, and InLoc show that TP3M achieves state-of-the-art performance in homography estimation, pose estimation, and visual localization, especially in visually challenging scenes. The findings highlight the value of geometry-aware 3D features derived from a reference view for improving cross-view correspondence and downstream localization tasks.
Abstract
Image matching is still challenging in such scenes with large viewpoints or illumination changes or with low textures. In this paper, we propose a Transformer-based pseudo 3D image matching method. It upgrades the 2D features extracted from the source image to 3D features with the help of a reference image and matches to the 2D features extracted from the destination image by the coarse-to-fine 3D matching. Our key discovery is that by introducing the reference image, the source image's fine points are screened and furtherly their feature descriptors are enriched from 2D to 3D, which improves the match performance with the destination image. Experimental results on multiple datasets show that the proposed method achieves the state-of-the-art on the tasks of homography estimation, pose estimation and visual localization especially in challenging scenes.
