Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement

Zhirui Gao; Renjiao Yi; Zheng Qin; Yunfan Ye; Chenyang Zhu; Kai Xu

Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement

Zhirui Gao, Renjiao Yi, Zheng Qin, Yunfan Ye, Chenyang Zhu, Kai Xu

TL;DR

This work tackles robust, pixel-precise template matching under cross-modal and cluttered industrial conditions by presenting a differentiable coarse-to-fine correspondence refinement pipeline. A key edge-aware translation module converts template masks and grayscale images to a common edge domain, while transformers encode and fuse multi-scale features to establish high-quality correspondences without RANSAC. Coarse matching via differentiable optimal transport with spatial-consistency weights yields a reliable initial homography $H_c$, which is refined by a fine-level network to sub-pixel accuracy and final homography $H$. The approach demonstrates state-of-the-art accuracy on industrial datasets (Mechanical Parts, Assembly Holes) and generalizes to unseen real data, with two new datasets and synthetic data generation enabling robust training. Practical impact is shown in industrial lines, enabling precise pose estimation for robotic grasping and part inspection, with competitive runtime and strong generalization.

Abstract

Template matching is a fundamental task in computer vision and has been studied for decades. It plays an essential role in manufacturing industry for estimating the poses of different parts, facilitating downstream tasks such as robotic grasping. Existing methods fail when the template and source images have different modalities, cluttered backgrounds or weak textures. They also rarely consider geometric transformations via homographies, which commonly exist even for planar industrial parts. To tackle the challenges, we propose an accurate template matching method based on differentiable coarse-to-fine correspondence refinement. We use an edge-aware module to overcome the domain gap between the mask template and the grayscale image, allowing robust matching. An initial warp is estimated using coarse correspondences based on novel structure-aware information provided by transformers. This initial alignment is passed to a refinement network using references and aligned images to obtain sub-pixel level correspondences which are used to give the final geometric transformation. Extensive evaluation shows that our method is significantly better than state-of-the-art methods and baselines, providing good generalization ability and visually plausible results even on unseen real data.

Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement

TL;DR

, which is refined by a fine-level network to sub-pixel accuracy and final homography

. The approach demonstrates state-of-the-art accuracy on industrial datasets (Mechanical Parts, Assembly Holes) and generalizes to unseen real data, with two new datasets and synthetic data generation enabling robust training. Practical impact is shown in industrial lines, enabling precise pose estimation for robotic grasping and part inspection, with competitive runtime and strong generalization.

Abstract

Paper Structure (53 sections, 14 equations, 12 figures, 6 tables)

This paper contains 53 sections, 14 equations, 12 figures, 6 tables.

Introduction
Related Work
Template Matching
Homography Estimation
Feature Matching
Vision Transformers
Overview
Method
Task
Feature Extraction and Aggregations
Edge translation
Feature extraction
Feature aggregation with transformers
Coarse Matching
Establishing coarse matches
...and 38 more sections

Figures (12)

Figure 1: Our template matching method. (a) Template $T$ and image $I$. (b) Coarse matching. (c) Matching refinement. (d) Template warped to the image using the estimated geometric transformation.
Figure 2: Pipeline: the proposed method has five steps. (1) Translation module: convert the source image $I$ and template mask $T$ into edge maps (Sec. \ref{['sec:edge']}). (2) Feature extraction: extract coarse-level feature maps and fine-level feature maps (Sec. \ref{['sec:extraction']}). (3) Coarse matching: two sets of coarse-level features are aggregated by interleaving self and cross attention layers to provide the initial homography transformation $H_c$ (Sec. \ref{['sec:coarselevel']}). (4.) Fine-level matching: global and local features are fused to give the set of sub-pixel level matches $\mathcal{M}_f$ (Sec. \ref{['sec:finelevelmatching']}). (5.) Homography estimation (Sec. \ref{['sec:homography estimation']}).
Figure 3: Architecture of encoder and attention layers. Left: encoder. Right: squared ($O(N^2)$ complexity) attention layer and linear ($O(N)$ complexity) attention layer .
Figure 4: Given two matching pairs $a=(i,i^{'})$ and $b=(j,j^{'})$, we calculate both their distance compatibility and their angular compatibility. Green nodes represent $k$-nearest neighbors.
Figure 5: Qualitative matching results for the three test datasets. Compared to SuperGlue, COTR and LoFTR, our method consistently obtains a higher inlier ratio, successfully coping with large viewpoint change, small objects and non-rigid deformation. Red indicates a reprojection error beyond 3 pixels for the Mechanical Parts and Assembly Holes datasets and 5 pixels for the COCO dataset. Further qualitative results can be found in the declarations \ref{['Declarations']}.
...and 7 more figures

Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement

TL;DR

Abstract

Learning Accurate Template Matching with Differentiable Coarse-to-Fine Correspondence Refinement

Authors

TL;DR

Abstract

Table of Contents

Figures (12)