Table of Contents
Fetching ...

CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

Zizhuo Li, Yifan Lu, Linfeng Tang, Shihua Zhang, Jiayi Ma

TL;DR

CoMatch addresses the efficiency-accuracy gap in semi-dense image matching by introducing a dynamic covisibility-aware Transformer that selectively condenses and attends over tokens based on covisibility. The method couples a covisibility-guided token condenser with a covisibility-assisted attention to robustly propagate context only from covisible regions, plus a bilateral subpixel refinement stage that optimizes correspondences in both views. Through coarse-to-fine matching with a two-stage bilateral refinement and joint supervision, CoMatch achieves state-of-the-art performance on pose estimation, homography, and visual localization while maintaining competitive speed. The approach demonstrates strong cross-dataset generalization and substantial practical impact for SLAM, SfM, and localization tasks.

Abstract

This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch's promising accuracy, efficiency, and generalizability.

CoMatch: Dynamic Covisibility-Aware Transformer for Bilateral Subpixel-Level Semi-Dense Image Matching

TL;DR

CoMatch addresses the efficiency-accuracy gap in semi-dense image matching by introducing a dynamic covisibility-aware Transformer that selectively condenses and attends over tokens based on covisibility. The method couples a covisibility-guided token condenser with a covisibility-assisted attention to robustly propagate context only from covisible regions, plus a bilateral subpixel refinement stage that optimizes correspondences in both views. Through coarse-to-fine matching with a two-stage bilateral refinement and joint supervision, CoMatch achieves state-of-the-art performance on pose estimation, homography, and visual localization while maintaining competitive speed. The approach demonstrates strong cross-dataset generalization and substantial practical impact for SLAM, SfM, and localization tasks.

Abstract

This prospective study proposes CoMatch, a novel semi-dense image matcher with dynamic covisibility awareness and bilateral subpixel accuracy. Firstly, observing that modeling context interaction over the entire coarse feature map elicits highly redundant computation due to the neighboring representation similarity of tokens, a covisibility-guided token condenser is introduced to adaptively aggregate tokens in light of their covisibility scores that are dynamically estimated, thereby ensuring computational efficiency while improving the representational capacity of aggregated tokens simultaneously. Secondly, considering that feature interaction with massive non-covisible areas is distracting, which may degrade feature distinctiveness, a covisibility-assisted attention mechanism is deployed to selectively suppress irrelevant message broadcast from non-covisible reduced tokens, resulting in robust and compact attention to relevant rather than all ones. Thirdly, we find that at the fine-level stage, current methods adjust only the target view's keypoints to subpixel level, while those in the source view remain restricted at the coarse level and thus not informative enough, detrimental to keypoint location-sensitive usages. A simple yet potent fine correlation module is developed to refine the matching candidates in both source and target views to subpixel level, attaining attractive performance improvement. Thorough experimentation across an array of public benchmarks affirms CoMatch's promising accuracy, efficiency, and generalizability.

Paper Structure

This paper contains 33 sections, 11 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Image matching accuracy and efficiency on MegaDepth. CoMatch achieves remarkably better accuracy than both sparse () and semi-dense () matchers with a commendable speed. Compared with dense matcher RoMa (), our method is $\sim6\times$ faster with comparable performance.
  • Figure 2: Visualization of covisibility prediction. We first bilinearly up-sample the covisibility score map to match the original image resolution, and then multiply it with the input image.
  • Figure 3: Pipeline overview.(1) Given a pair of images, a CNN network extracts coarse features $^{(0)}\mathbf{F}^{\mathbf{A}}$ and $^{(0)}\mathbf{F}^{\mathbf{B}}$, alongside fine ones. (2) Dynamic covisibility-aware Transformer is stacked $L$ times to conduct efficient, robust, and compact context interaction for coarse feature transformation. (3) Transformed coarse features are correlated, followed by a dual-softmax (DS) operation to yield the assignment matrix $\mathcal{S}$, where mutual nearest neighbor (MNN) matching is used to establish coarse matches $\mathcal{M}_c$. (4) Fine distinctive features $\widehat{\mathbf{F}}^\mathbf{B}$ and $\widehat{\mathbf{F}}^\mathbf{B}$ at the original resolution are derived by progressively fusing $^{(L)}\mathbf{F}^{\mathbf{A}}$ and $^{(L)}\mathbf{F}^{\mathbf{B}}$ with backbone features at $1/4$ and $1/2$ resolutions. Later, feature patches centered on $\mathcal{M}_c$ are cropped, followed by a two-stage refinement to produce fine matches $\mathcal{M}_f$ with bilateral subpixel accuracy.
  • Figure 4: Visualization of matching results on MegaDepth and ScanNet. A match is "" if its epipolar error is below $1\times10^{-4}$ for MegaDepth and $5\times10^{-4}$ for ScanNet, and "" otherwise.
  • Figure 5: Visualization of covisibility prediction. We first bilinearly up-sample the covisibility score map to match the original image resolution, and then multiply it with the input image.
  • ...and 1 more figures