Table of Contents
Fetching ...

An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution

Ke Jia, Ji Zhou, Hanxin Li, Zhigan Zhou, Haojie Chu, Xiaojie Li

TL;DR

This work proposes a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales, and introduces a rotation-shear-based augmentation strategy with structure-aware pseudo labels.

Abstract

In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target's position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.

An Efficient Deep Template Matching and In-Plane Pose Estimation Method via Template-Aware Dynamic Convolution

TL;DR

This work proposes a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales, and introduces a rotation-shear-based augmentation strategy with structure-aware pseudo labels.

Abstract

In industrial inspection and component alignment tasks, template matching requires efficient estimation of a target's position and geometric state (rotation and scaling) under complex backgrounds to support precise downstream operations. Traditional methods rely on exhaustive enumeration of angles and scales, leading to low efficiency under compound transformations. Meanwhile, most deep learning-based approaches only estimate similarity scores without explicitly modeling geometric pose, making them inadequate for real-world deployment. To overcome these limitations, we propose a lightweight end-to-end framework that reformulates template matching as joint localization and geometric regression, outputting the center coordinates, rotation angle, and independent horizontal and vertical scales. A Template-Aware Dynamic Convolution Module (TDCM) dynamically injects template features at inference to guide generalizable matching. The compact network integrates depthwise separable convolutions and pixel shuffle for efficient matching. To enable geometric-annotation-free training, we introduce a rotation-shear-based augmentation strategy with structure-aware pseudo labels. A lightweight refinement module further improves angle and scale precision via local optimization. Experiments show our 3.07M model achieves high precision and 14ms inference under compound transformations. It also demonstrates strong robustness in small-template and multi-object scenarios, making it highly suitable for deployment in real-time industrial applications. The code is available at:https://github.com/ZhouJ6610/PoseMatch-TDCM.

Paper Structure

This paper contains 35 sections, 12 equations, 8 figures, 7 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the proposed framework. Shallow features are extracted from template and search images; template features are encoded as dynamic kernels and applied to search features, then decoded into response and parameter maps, followed by lightweight refinement for accurate pose estimation.
  • Figure 2: Architecture of the Template-Aware Dynamic Convolution Module (TDCM). The template features is encoded as a dynamic convolution kernel and applied to the shallow search features via depthwise separable convolution. This enables structure-aligned feature fusion and facilitates pose-aware representation learning.
  • Figure 3: Representative matching results under various transformation settings.
  • Figure 4: Performance comparison in multi-target matching scenarios. Our method achieves the highest precision and recall across all transformation levels and surpasses SHM in precision and mIoU under mild and moderate transformations (S1–S1.5). It also offers better inference efficiency in compound transformation scenarios.
  • Figure 5: Typical failure and challenging cases. Each subfigure shows the input template, the corresponding search image, and the predicted results. The red bounding box denotes the ground-truth position, while the green bounding box indicates the predicted location. $GT_{score}$ refers to the ground-truth center heatmap generated via affine transformation. $pred_{score}$ denotes predicted center heatmap. (a) normal; (b) Tiny template; (c) Severe aspect ratio distortion; (d–e) Highly similar distractors; (f) low-texture template; (g-h) background clutter with similar patterns; (i-j) multi-object suppression.
  • ...and 3 more figures