Generalized Correspondence Matching via Flexible Hierarchical Refinement and Patch Descriptor Distillation
Yu Han, Ziwei Long, Yanting Zhang, Jin Wu, Zhijun Fang, Rui Fan
TL;DR
This work tackles robust, plug-and-play correspondence matching for robotics by generalizing deep feature matching (DFM) through a threshold-free hierarchical refinement and a backbone-agnostic patch descriptor. It introduces three core innovations: a flexible nearest-neighbor search that eliminates a fixed refinement threshold, a patch descriptor that enables compatibility with backbones trained for diverse tasks, and a patch-descriptor distillation strategy to reduce computation. Empirical results on HPatches, MegaDepth, and ScanNet show state-of-the-art mean matching accuracy and strong pose and homography performance, particularly in textureless or repetitive regions, while significantly reducing descriptor complexity. The approach broadens the practical deployment of learned matching in real-world robotics applications by delivering dense, reliable correspondences across varied backbone architectures.
Abstract
Correspondence matching plays a crucial role in numerous robotics applications. In comparison to conventional hand-crafted methods and recent data-driven approaches, there is significant interest in plug-and-play algorithms that make full use of pre-trained backbone networks for multi-scale feature extraction and leverage hierarchical refinement strategies to generate matched correspondences. The primary focus of this paper is to address the limitations of deep feature matching (DFM), a state-of-the-art (SoTA) plug-and-play correspondence matching approach. First, we eliminate the pre-defined threshold employed in the hierarchical refinement process of DFM by leveraging a more flexible nearest neighbor search strategy, thereby preventing the exclusion of repetitive yet valid matches during the early stages. Our second technical contribution is the integration of a patch descriptor, which extends the applicability of DFM to accommodate a wide range of backbone networks pre-trained across diverse computer vision tasks, including image classification, semantic segmentation, and stereo matching. Taking into account the practical applicability of our method in real-world robotics applications, we also propose a novel patch descriptor distillation strategy to further reduce the computational complexity of correspondence matching. Extensive experiments conducted on three public datasets demonstrate the superior performance of our proposed method. Specifically, it achieves an overall performance in terms of mean matching accuracy of 0.68, 0.92, and 0.95 with respect to the tolerances of 1, 3, and 5 pixels, respectively, on the HPatches dataset, outperforming all other SoTA algorithms. Our source code, demo video, and supplement are publicly available at mias.group/GCM.
