Table of Contents
Fetching ...

MESA: Matching Everything by Segmenting Anything

Yesheng Zhang, Xu Zhao

TL;DR

MESA addresses feature-matching redundancy by restricting dense comparisons to SAM-segmented image areas and formalizing area matching on a new multi-relational Area Graph. By deriving an Area Markov Random Field (AMRF) and an Area Bayesian Network (ABN), and solving via Graph Cut with a global energy refinement, MESA achieves precise area correspondences and improves downstream pose estimation for both indoor and outdoor tasks. The approach yields substantial gains across semi-dense and dense matchers (e.g., up to +15.3% AUC@5° indoor and +13.6% indoor for DKM), validating the practical impact of area-level matching. Despite higher runtime, the method demonstrates robust performance and clear directions for speeding up via SAM feature distillation and parallelization, making area-aware matching a viable pre-step for high-precision visual localization and navigation.

Abstract

Feature matching is a crucial task in the field of computer vision, which involves finding correspondences between images. Previous studies achieve remarkable performance using learning-based feature comparison. However, the pervasive presence of matching redundancy between images gives rise to unnecessary and error-prone computations in these methods, imposing limitations on their accuracy. To address this issue, we propose MESA, a novel approach to establish precise area (or region) matches for efficient matching redundancy reduction. MESA first leverages the advanced image understanding capability of SAM, a state-of-the-art foundation model for image segmentation, to obtain image areas with implicit semantic. Then, a multi-relational graph is proposed to model the spatial structure of these areas and construct their scale hierarchy. Based on graphical models derived from the graph, the area matching is reformulated as an energy minimization task and effectively resolved. Extensive experiments demonstrate that MESA yields substantial precision improvement for multiple point matchers in indoor and outdoor downstream tasks, e.g. +13.61% for DKM in indoor pose estimation.

MESA: Matching Everything by Segmenting Anything

TL;DR

MESA addresses feature-matching redundancy by restricting dense comparisons to SAM-segmented image areas and formalizing area matching on a new multi-relational Area Graph. By deriving an Area Markov Random Field (AMRF) and an Area Bayesian Network (ABN), and solving via Graph Cut with a global energy refinement, MESA achieves precise area correspondences and improves downstream pose estimation for both indoor and outdoor tasks. The approach yields substantial gains across semi-dense and dense matchers (e.g., up to +15.3% AUC@5° indoor and +13.6% indoor for DKM), validating the practical impact of area-level matching. Despite higher runtime, the method demonstrates robust performance and clear directions for speeding up via SAM feature distillation and parallelization, making area-aware matching a viable pre-step for high-precision visual localization and navigation.

Abstract

Feature matching is a crucial task in the field of computer vision, which involves finding correspondences between images. Previous studies achieve remarkable performance using learning-based feature comparison. However, the pervasive presence of matching redundancy between images gives rise to unnecessary and error-prone computations in these methods, imposing limitations on their accuracy. To address this issue, we propose MESA, a novel approach to establish precise area (or region) matches for efficient matching redundancy reduction. MESA first leverages the advanced image understanding capability of SAM, a state-of-the-art foundation model for image segmentation, to obtain image areas with implicit semantic. Then, a multi-relational graph is proposed to model the spatial structure of these areas and construct their scale hierarchy. Based on graphical models derived from the graph, the area matching is reformulated as an energy minimization task and effectively resolved. Extensive experiments demonstrate that MESA yields substantial precision improvement for multiple point matchers in indoor and outdoor downstream tasks, e.g. +13.61% for DKM in indoor pose estimation.
Paper Structure (30 sections, 16 equations, 7 figures, 9 tables)

This paper contains 30 sections, 16 equations, 7 figures, 9 tables.

Figures (7)

  • Figure 1: The matching redundancy reduction in MESA. High-level rgb]0.95,0.93,0.90image understanding enables efficient matching redundancy reduction, allowing for precise point matching by dense rgb]0.95,0.98,0.94feature comparison. Therefore, MESA effectively reduces the matching redundancy by area matching based on SAM sam segmentation, significantly improving the accuracy of DKM dkm.
  • Figure 2: Overview of MESA. Based on ❶ SAM segmentation, we first construct ❷ Area Graphs. Then the graph is turned to two graphical models based on its two different edges. Through ❸ Area Markov Random Field, area matching is formulated as an ❹ Energy Minimization. Then, leveraging ❺ Area Bayesian Network and our ❺ Learning Area Similarity Calculation, ❻ Graph Energy can be efficiently calculated. Therefore, ❼ Graph Cut is utilized to obtain putative area matches. Finally, ❽ Global Energy Minimization determines the best area match, which serves as the input of subsequent point matcher for precise feature matching, following the ❾ Area to Point Matching framework sgam.
  • Figure 3: The proposed Area Graph. The graph nodes (circles with masks representing rectangle areas) includes both areas from SAM results (white boundaries) and our graph completion algorithm (black boundaries). They are divided into four levels according to their sizes. The adjacency edges (dashed lines) and inclusion edges (arrows) connect all nodes. Only adjacency edges within the same level are shown for better view.
  • Figure 4: Learning area similarity. The area similarity calculation is formed as the patch-level classification. We predict the probability of each patch in one area appearing on the other to construct activity maps. The similarity is obtained by the product of activity expectations, contributing to our exact area matching.
  • Figure 5: The qualitative comparison of Global Energy Refinement. As AG structures of both images are considered by $E_G$, objects with the same apparent can be distinguished according to their neighbors, which are mismatched by $\arg\min E_{self}$, revealing the robustness of $\arg\min E_G$ under repetitive patterns.
  • ...and 2 more figures