Table of Contents
Fetching ...

A Detector-oblivious Multi-arm Network for Keypoint Matching

Xuelun Shen, Qian Hu, Xin Li, Cheng Wang

TL;DR

This work tackles robust, detector-agnostic keypoint matching by introducing a detector-oblivious Multi-Arm Network (MAN) that leverages region-level cues. By incorporating two auxiliary tasks—Overlap Estimation (OVE) and Depth Region Estimation (DEE)—the network learns region-aware features across two images and fuses them via three parallel arms to produce multiple similarity matrices. The detector-oblivious description path ensures compatibility with arbitrary keypoint detectors, enabling retraining-free deployment. Empirical results across outdoor and indoor datasets demonstrate state-of-the-art or competitive performance in pose estimation and visual localization, with ablations illustrating the value of auxiliary tasks and multi-arm design. The approach offers improved reliability in challenging conditions while maintaining practical inference efficiency.

Abstract

This paper presents a matching network to establish point correspondence between images. We propose a Multi-Arm Network (MAN) to learn region overlap and depth, which can greatly improve the keypoint matching robustness while bringing little computational cost during the inference stage. Another design that makes this framework different from many existing learning based pipelines that require re-training when a different keypoint detector is adopted, our network can directly work with different keypoint detectors without such a time-consuming re-training process. Comprehensive experiments conducted on outdoor and indoor datasets demonstrated that our proposed MAN outperforms state-of-the-art methods.

A Detector-oblivious Multi-arm Network for Keypoint Matching

TL;DR

This work tackles robust, detector-agnostic keypoint matching by introducing a detector-oblivious Multi-Arm Network (MAN) that leverages region-level cues. By incorporating two auxiliary tasks—Overlap Estimation (OVE) and Depth Region Estimation (DEE)—the network learns region-aware features across two images and fuses them via three parallel arms to produce multiple similarity matrices. The detector-oblivious description path ensures compatibility with arbitrary keypoint detectors, enabling retraining-free deployment. Empirical results across outdoor and indoor datasets demonstrate state-of-the-art or competitive performance in pose estimation and visual localization, with ablations illustrating the value of auxiliary tasks and multi-arm design. The approach offers improved reliability in challenging conditions while maintaining practical inference efficiency.

Abstract

This paper presents a matching network to establish point correspondence between images. We propose a Multi-Arm Network (MAN) to learn region overlap and depth, which can greatly improve the keypoint matching robustness while bringing little computational cost during the inference stage. Another design that makes this framework different from many existing learning based pipelines that require re-training when a different keypoint detector is adopted, our network can directly work with different keypoint detectors without such a time-consuming re-training process. Comprehensive experiments conducted on outdoor and indoor datasets demonstrated that our proposed MAN outperforms state-of-the-art methods.

Paper Structure

This paper contains 18 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: When images under drastic viewpoint and scale changes, the state-of-the-art pipeline SuperGlue GLUE2020 fails to produce reliable matching. However our MAN can more effectively find the correspondences. One example is illustrated here, where red and green lines indicate incorrect and correct matches, respectively.
  • Figure 2: Architecture of the proposed detector-oblivious multi-arm network (MAN). The functions $\mathop{\phi}$ and $\mathop{\psi}$ represent the auxiliary networks for overlap and depth regions estimation, respectively. Symbols in red and blue denote the operations whose target images are images A and B.
  • Figure 3: Architecture of SuperGlue GLUE2020, where $\mathbf{d}$, $\mathbf{p}$, and $\mathbf{c}$ represent keypoint descriptions, coordinates, and confidence, respectively.
  • Figure 4: Data visualization corresponding to the depth region estimation task. Consider a pair of images $\mathbf{A}$ and $\mathbf{B}$ and their corresponding depth maps $\mathbf{D}^{A}$ and $\mathbf{D}^{B}$. The depth invalid area is shown in black. We divide the image into $C=3$ regions according to depth values, of which the valid regions occupy two regions (blue and yellow) and the invalid region occupies one (black).
  • Figure 5: Qualitative image matches. We compared MAN and SuperGlue GLUE2020 in indoor and outdoor environments. Red and green lines indicate incorrect and correct matches, respectively.