MatchDet: A Collaborative Framework for Image Matching and Object Detection

Jinxiang Lai; Wenlong Wu; Bin-Bin Gao; Jun Liu; Jiawei Zhan; Congchong Nie; Yi Zeng; Chengjie Wang

MatchDet: A Collaborative Framework for Image Matching and Object Detection

Jinxiang Lai, Wenlong Wu, Bin-Bin Gao, Jun Liu, Jiawei Zhan, Congchong Nie, Yi Zeng, Chengjie Wang

TL;DR

MatchDet presents a novel task-collaborative framework that jointly optimizes image matching and object detection. By integrating a Weighted Attention Module (WAM), a Weighted Spatial Attention Module (WSAM), and a Box Filter, the approach enhances feature interaction between paired images and foreground regions, leading to mutual improvements in both tasks. Empirical results on Warp-COCO and miniScanNet demonstrate substantial gains in AP for detection and AUC for matching, validating the effectiveness of collaborative learning and the proposed modules. This framework offers a practical, single-model solution for applications requiring simultaneous correspondence estimation and object localization, with clear pathways for extending to different detectors and matchers.

Abstract

Image matching and object detection are two fundamental and challenging tasks, while many related applications consider them two individual tasks (i.e. task-individual). In this paper, a collaborative framework called MatchDet (i.e. task-collaborative) is proposed for image matching and object detection to obtain mutual improvements. To achieve the collaborative learning of the two tasks, we propose three novel modules, including a Weighted Spatial Attention Module (WSAM) for Detector, and Weighted Attention Module (WAM) and Box Filter for Matcher. Specifically, the WSAM highlights the foreground regions of target image to benefit the subsequent detector, the WAM enhances the connection between the foreground regions of pair images to ensure high-quality matches, and Box Filter mitigates the impact of false matches. We evaluate the approaches on a new benchmark with two datasets called Warp-COCO and miniScanNet. Experimental results show our approaches are effective and achieve competitive improvements.

MatchDet: A Collaborative Framework for Image Matching and Object Detection

TL;DR

Abstract

Paper Structure (32 sections, 10 equations, 7 figures, 6 tables)

This paper contains 32 sections, 10 equations, 7 figures, 6 tables.

Introduction
Related Work
Problem Definition
Methodology
MDBase Network
MatchDet Network
Weighted Attention Module
Weighted Spatial Attention Module
Match Head with Box Filter
Loss Function
Discussion
Task-collaborative vs. Task-individual
WSAM vs. WAM
Weighted Attention vs. Masked Attention
Experiments
...and 17 more sections

Figures (7)

Figure 1: (a) Our MatchDet with collaborative learning for improving image matching and object detection. We introduce a baseline named MDBase network, which removes the collaborative learning module of MatchDet. (b) The object Tracker with correlation-aggregation learning. The dashed line represents that the Tracker has the potential ability to obtain pairwise correspondences, while there is no matching objective function to supervise it. (c) and (d) are the results on Warp-COCO dataset. (c) Our MatchDet obtains 4.06% improvement in object detection. (d) Our MatchDet achieves 24.24% higher performance in image matching.
Figure 2: The network architecture of our MatchDet. There are four stages: ① Obtaining basic features $\{{C^t_3},{C^r_3}\}$ with a shared backbone. ② Matcher branch estimates the homography matrix with the enhanced features $\{\bar{C}^t_3,\bar{C}^r_3\}$ produced by Weighted Attention Module. ③ Detector branch predicts the bounding boxes based on the highlighted features ${C^t_3}'$ generated by Weighted Spatial Attention Module. ④ Box Filter refines the image matching results via filtering out the potential mismatches.
Figure 3: (a) The Weighted Attention Module (WAM) consists of a Weighted Attention block and a Self-Attention block, where $\{Q,K,V\}$ are known as $\{\emph{query}, \emph{key}, \emph{value}\}$ and FFN denotes Feed-Forward Network in Transformer. (b) The Weighted Attention applied in WAM, where ${\odot}$ is Broadcasting Element-wise Product. The variables dimensions are ${\{V^{\sim Q},Q,K,V\} \in \mathbb{R}^{hw \times c }}$ and ${\{M_Q,M_K\} \in \mathbb{R}^{hw}}$. (c) The Weighted Spatial Attention enhances the spatial response of $Q$ by ${M_{QV} \in \mathbb{R}^{hw}}$ to obtain ${Q' \in \mathbb{R}^{hw \times c }}$, where $\langle \cdot \rangle$ calculates the cosine similarity. And replacing Weighted Attention of WAM with Weighted Spatial Attention derives the Weighted Spatial Attention Module (WSAM).
Figure 4: The visualizations of the generated Weighted Map.
Figure 5: The visualizations for WAM, WSAM, Box Filter and MatchDet results under GTBoxR setting from miniScanNet. (a) - (e), are the results processed by the corresponding modules before and after, respectively. (e) shows the predicted bounding boxes and matching results of MatchDet, where these matches are obtained after Box Filter. (f) is Ground-Truth.
...and 2 more figures

MatchDet: A Collaborative Framework for Image Matching and Object Detection

TL;DR

Abstract

MatchDet: A Collaborative Framework for Image Matching and Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)