Understanding and Optimizing Attention-Based Sparse Matching for Diverse Local Features
Qiang Wang
TL;DR
This work revisits training of attention-based sparse image matchers for diverse local features and identifies removing nearby keypoints as a critical design choice. By decoupling detectors from descriptors, it shows that detectors often drive transformer-based matching performance, while descriptors can be largely transferable with proper training. The authors propose a detector-agnostic fine-tuning strategy that trains existing LightGlue models with correspondences from multiple detectors while keeping descriptor networks fixed, achieving strong zero-shot generalization to unseen detectors across MegaDepth-1500, IMC2021, Aachen Day-Night, and InLoc. The results provide practical guidance for deploying transformer-based matchers, suggest ensemble strategies over multiple detectors, and open avenues for robust, cross-detector matching in real-world localization and mapping pipelines.
Abstract
We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.
