Table of Contents
Fetching ...

Explaining Away Results in Accurate and Tolerant Template Matching

M. W. Spratling

TL;DR

This paper introduces a template matching approach that uses explaining away through Divisive Input Modulation (DIM) to make template-evidence compete and produce sparse, robust matches. By pre-processing images to emphasize relative intensity and formulating a competitive, iterative inference, the method achieves greater tolerance to appearance changes than traditional template matching and several recent alternatives. Across multiple benchmarks, including Best Buddies and Oxford VGG datasets, DIM with additional non-target templates consistently outperforms baselines in accuracy, while remaining adaptable through parameter settings. The work demonstrates practical gains for patch localization and correspondence, and outlines clear avenues for extending tolerance to viewpoint changes and to CNN-based feature spaces.

Abstract

Recognising and locating image patches or sets of image features is an important task underlying much work in computer vision. Traditionally this has been accomplished using template matching. However, template matching is notoriously brittle in the face of changes in appearance caused by, for example, variations in viewpoint, partial occlusion, and non-rigid deformations. This article tests a method of template matching that is more tolerant to such changes in appearance and that can, therefore, more accurately identify image patches. In traditional template matching the comparison between a template and the image is independent of the other templates. In contrast, the method advocated here takes into account the evidence provided by the image for the template at each location and the full range of alternative explanations represented by the same template at other locations and by other templates. Specifically, the proposed method of template matching is performed using a form of probabilistic inference known as "explaining away". The algorithm used to implement explaining away has previously been used to simulate several neurobiological mechanisms, and been applied to image contour detection and pattern recognition tasks. Here it is applied for the first time to image patch matching, and is shown to produce superior results in comparison to the current state-of-the-art methods.

Explaining Away Results in Accurate and Tolerant Template Matching

TL;DR

This paper introduces a template matching approach that uses explaining away through Divisive Input Modulation (DIM) to make template-evidence compete and produce sparse, robust matches. By pre-processing images to emphasize relative intensity and formulating a competitive, iterative inference, the method achieves greater tolerance to appearance changes than traditional template matching and several recent alternatives. Across multiple benchmarks, including Best Buddies and Oxford VGG datasets, DIM with additional non-target templates consistently outperforms baselines in accuracy, while remaining adaptable through parameter settings. The work demonstrates practical gains for patch localization and correspondence, and outlines clear avenues for extending tolerance to viewpoint changes and to CNN-based feature spaces.

Abstract

Recognising and locating image patches or sets of image features is an important task underlying much work in computer vision. Traditionally this has been accomplished using template matching. However, template matching is notoriously brittle in the face of changes in appearance caused by, for example, variations in viewpoint, partial occlusion, and non-rigid deformations. This article tests a method of template matching that is more tolerant to such changes in appearance and that can, therefore, more accurately identify image patches. In traditional template matching the comparison between a template and the image is independent of the other templates. In contrast, the method advocated here takes into account the evidence provided by the image for the template at each location and the full range of alternative explanations represented by the same template at other locations and by other templates. Specifically, the proposed method of template matching is performed using a form of probabilistic inference known as "explaining away". The algorithm used to implement explaining away has previously been used to simulate several neurobiological mechanisms, and been applied to image contour detection and pattern recognition tasks. Here it is applied for the first time to image patch matching, and is shown to produce superior results in comparison to the current state-of-the-art methods.

Paper Structure

This paper contains 17 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example results for different algorithms when applied to the task of finding corresponding locations in 105 pairs of colour video frames. Images in the first row show the target templates (outlined in yellow) in the initial frame of the video. Images in the second row show the location of the target identified by the DIM algorithm (outlined in cyan) and the location of the target defined by the ground-truth data (outlined in yellow) in a later frame of the same video. The third to seventh rows show the similarity of the target template to the second image as determined by (from row 3 to 7): ZNCC, BBS, DDIS, DIM with no additional templates, and DIM with up to four additional templates chosen by maximum correlation. Darker pixels correspond to stronger similarity. Note, matching was performed using colour templates and colour images, but for clarity the images are shown in grayscale in rows 1 and 2.
  • Figure 2: The performance of different algorithms when applied to the task of finding corresponding locations in 105 pairs of colour video frames. Each curve shows the fraction of targets for which the overlap between the ground-truth and predicted bounding-boxes exceeded the threshold indicated on the x-axis. (a) Results when using the target location predicted by the maximum similarity. (b) Results when using the maximum overlap predicted by the seven highest similarity values. The results for DIM are produced using up to four additional templates chosen by maximum correlation.
  • Figure 3: The effect of the maximum number of additional templates, and their selection method, on the performance of the proposed method, DIM, when applied to finding corresponding locations in 105 pairs of colour video frames. Results are shown when the additional templates were selected from (a) the first image in each pair, and (b) an unrelated image.
  • Figure 4: The performance of different algorithms when applied to the task of finding corresponding locations across image sequences from the Oxford VGG affine covariant features dataset (at half size with 25 templates per image pair). Each curve shows the fraction of targets for which the overlap between the ground-truth and predicted bounding-boxes exceeded the threshold indicated on the x-axis. Results for different image sequences are shown using different line styles, as indicated in the key. Each row shows results for a different algorithm (from top to bottom): ZNCC, BBS, DDIS, and DIM. Each column shows results for a different template size (from left to right): 17-by-17, 33-by-33, and 49-by-49 pixels.
  • Figure 5: The performance of different algorithms when applied to performing template matching in colour images from the Oxford VGG affine covariant features dataset (at half size). Each curve shows the trade-off between precision and recall for different thresholds applied to the similarity values. A match was considered correct if the bounding box overlap between the predicted location and the true location was at least 0.5. Results are shown for three different sizes of template (a) 17-by-17 pixels, (b) 33-by-33 pixels, and (c) 49-by-49 pixels.