Table of Contents
Fetching ...

Normalized Matching Transformer

Abtin Pourhadi, Paul Swoboda

TL;DR

Normalized Matching Transformer tackles sparse keypoint matching with a fully differentiable pipeline that omits combinatorial solvers. It combines a swin-transformer backbone, a geometry-aware SplineCNN GNN, and a two-stream normalized transformer decoder whose affinities are refined by Sinkhorn decoding. Training employs InfoNCE and hyperspherical losses with data augmentation, yielding superior performance on PascalVOC and SPair-71k and requiring fewer epochs to converge. The approach demonstrates that pervasive normalization across architecture and losses can significantly boost training stability and matching accuracy in keypoint correspondence tasks.

Abstract

We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by $5.1\%$ and $2.2\%$ respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least $1.7x$ fewer epochs.

Normalized Matching Transformer

TL;DR

Normalized Matching Transformer tackles sparse keypoint matching with a fully differentiable pipeline that omits combinatorial solvers. It combines a swin-transformer backbone, a geometry-aware SplineCNN GNN, and a two-stream normalized transformer decoder whose affinities are refined by Sinkhorn decoding. Training employs InfoNCE and hyperspherical losses with data augmentation, yielding superior performance on PascalVOC and SPair-71k and requiring fewer epochs to converge. The approach demonstrates that pervasive normalization across architecture and losses can significantly boost training stability and matching accuracy in keypoint correspondence tasks.

Abstract

We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by and respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least fewer epochs.

Paper Structure

This paper contains 22 sections, 4 equations, 4 figures, 4 tables, 4 algorithms.

Figures (4)

  • Figure 1: Normalized matching transformer inference. A pair of images is passed each through a swin-transformer visual backbone. Features at keypoints are extracted and given through a SplineCNN for further feature refinement. A normalized transformer decoder interleaves self-attention between keypoint features from the same image with cross-attention that mixes information across images. Finally, cosine similarities are computed and given as affinities to a logspace Sinkhorn routine from which a matching is decoded.
  • Figure 2: Normalized matching transformer losses. Losses are applied on the features that are computed by the normalized transformer decoder. InfoNCE losses are computed on cosine similarities of features coming from a single keypoint in one image and all keypoint features from the other one and align matching correspondences. For symmetry we apply the InfoNCE loss in both directions. For distributing features of different keypoints in the same image we use a hyperspherical loss on the keypoint features coming from each image separately.
  • Figure 3: Geometric illustration of hyperspherical and InfoNCE losses. The hyperspherical losses (two left spheres) from \ref{['eq:prot_loss_s']} distributes different keypoint features $f^i_j$ for different keypoints $j\in [m]$ and each image $i \in [2]$ across the hypersphere and is applied to each image separately. The InfoNCE (right side) loss from \ref{['eq:infoNCE']} aligns features $f^1_j \Leftrightarrow f^2_j$ from matching keypoints (assuming the matching is identity here) from different images.
  • Figure 4: Qualitative results of selected keypoint matchings from the SPair-71k min2019spair dataset. The top row depicts perfect matchings, while the bottom row shows a few failure cases.