Normalized Matching Transformer
Abtin Pourhadi, Paul Swoboda
TL;DR
Normalized Matching Transformer tackles sparse keypoint matching with a fully differentiable pipeline that omits combinatorial solvers. It combines a swin-transformer backbone, a geometry-aware SplineCNN GNN, and a two-stream normalized transformer decoder whose affinities are refined by Sinkhorn decoding. Training employs InfoNCE and hyperspherical losses with data augmentation, yielding superior performance on PascalVOC and SPair-71k and requiring fewer epochs to converge. The approach demonstrates that pervasive normalization across architecture and losses can significantly boost training stability and matching accuracy in keypoint correspondence tasks.
Abstract
We present a new state of the art approach for sparse keypoint matching between pairs of images. Our method consists of a fully deep learning based approach combining a visual backbone coupled with a SplineCNN graph neural network for feature processing and a normalized transformer decoder for decoding keypoint correspondences together with the Sinkhorn algorithm. Our method is trained using a contrastive and a hyperspherical loss for better feature representations. We additionally use data augmentation during training. This comparatively simple architecture combining extensive normalization and advanced losses outperforms current state of the art approaches on PascalVOC and SPair-71k datasets by $5.1\%$ and $2.2\%$ respectively compared to BBGM, ASAR, COMMON and GMTR while training for at least $1.7x$ fewer epochs.
