Table of Contents
Fetching ...

Are Semi-Dense Detector-Free Methods Good at Matching Local Features?

Matthieu Vilain, Rémi Giraud, Hugo Germain, Guillaume Bourmaud

TL;DR

The paper investigates how the quality of correspondences relates to pose estimation in semi-dense detector-free methods and introduces SAM, a structured attention-based image matching architecture. SAM leverages a latent-vector space and structured cross/self-attention to predict dense correspondences, followed by a refinement stage to reach full resolution. Across MegaDepth, HPatches, and ETH3D, SAM achieves competitive or superior pose/homography estimation compared with SDF methods, yet exhibits lower overall matching accuracy unless evaluation is restricted to textured regions (MA_text), where SAM often wins. The results reveal a strong correlation between accurate textured-region correspondences and pose accuracy, suggesting region-aware evaluation is crucial for understanding and improving pose estimation in local-feature matching, with code to be released.

Abstract

Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.

Are Semi-Dense Detector-Free Methods Good at Matching Local Features?

TL;DR

The paper investigates how the quality of correspondences relates to pose estimation in semi-dense detector-free methods and introduces SAM, a structured attention-based image matching architecture. SAM leverages a latent-vector space and structured cross/self-attention to predict dense correspondences, followed by a refinement stage to reach full resolution. Across MegaDepth, HPatches, and ETH3D, SAM achieves competitive or superior pose/homography estimation compared with SDF methods, yet exhibits lower overall matching accuracy unless evaluation is restricted to textured regions (MA_text), where SAM often wins. The results reveal a strong correlation between accurate textured-region correspondences and pose accuracy, suggesting region-aware evaluation is crucial for understanding and improving pose estimation in local-feature matching, with code to be released.

Abstract

Semi-dense detector-free approaches (SDF), such as LoFTR, are currently among the most popular image matching methods. While SDF methods are trained to establish correspondences between two images, their performances are almost exclusively evaluated using relative pose estimation metrics. Thus, the link between their ability to establish correspondences and the quality of the resulting estimated pose has thus far received little attention. This paper is a first attempt to study this link. We start with proposing a novel structured attention-based image matching architecture (SAM). It allows us to show a counter-intuitive result on two datasets (MegaDepth and HPatches): on the one hand SAM either outperforms or is on par with SDF methods in terms of pose/homography estimation metrics, but on the other hand SDF approaches are significantly better than SAM in terms of matching accuracy. We then propose to limit the computation of the matching accuracy to textured regions, and show that in this case SAM often surpasses SDF methods. Our findings highlight a strong correlation between the ability to establish accurate correspondences in textured regions and the accuracy of the resulting estimated pose/homography. Our code will be made available.
Paper Structure (20 sections, 5 equations, 9 figures, 4 tables)

This paper contains 20 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Given query locations within textured regions of the source image (left), we show their predicted correspondents in the target image (right) for: (Top row) SAM (Proposed) - Structured Attention-based image Matching, (Bottom row) LoFTR sun2021loftr+QuadTree tang2022quadtree - a semi-dense detector-free approach (line colors indicate the distance in pixels with respect to the ground truth correspondent). We report: (MA@2) - the matching accuracy at 2 pixels computed on all the semi-dense locations of the source image with available ground truth correspondent (which includes both textured and uniform regions), (MA$_\text{text}$@2) - the matching accuracy at 2 pixels computed on all the textured semi-dense locations of the source image with available ground truth correspondent (i.e., uniform regions are ignored), (errR and errT) - the relative pose error. SAM has a better pose estimation but a lower matching accuracy (MA@2), which seems counter-intuitive. However, if we consider only textured regions (MA$_\text{text}$@2), then SAM outperforms LoFTR +QuadTree.
  • Figure 2: Overview of the proposed Structured Attention-based image Matching (SAM) method. (a) The matching architecture first extracts features from both source and target images at resolution 1/4. Then it uses a set of learned latent vectors alongside the descriptors of the query locations, and performs an input structured cross-attention with the dense features of the target. The latent space is then processed through a succession of structured self-attention layers. An output structured cross-attention is applied to update the target features with the information from the latent space. Finally, the correspondence maps are obtained using a dot product. (b) Proposed structured attention layer. See text for details.
  • Figure 3: Visualization of the learned latent vectors of SAM. The average query map is obtained by averaging 64 correspondence maps of 64 query locations (Red crosses) while the average latent map is obtained by averaging the 128 correspondence maps of the 128 learned latent vectors. We observe that the average query map is mainly activated around the correspondents, whereas these regions are less activated in the average latent map.
  • Figure 4: Visualization - Structured attention. The visio-positional and positional maps are computed before the output cross-attention. Red crosses represent the ground-truth correspondences. Blue and Green crosses are the maxima of the visio-positional and positional maps, respectively. One can see that the visio-positional maps are highly multimodal (i.e., sensitive to repetitive structures) while the positional maps are almost unimodal.
  • Figure 5: Qualitative results on MegaDepth1500. For each image pair: (top row) Visualization of established correspondences used to compute the MA, (bottom row) Visualization of established correspondences used to compute the MA$_\text{text}$. Line colors indicate the distance in pixels with respect to the ground truth correspondent.
  • ...and 4 more figures