Table of Contents
Fetching ...

Understanding and Optimizing Attention-Based Sparse Matching for Diverse Local Features

Qiang Wang

TL;DR

This work revisits training of attention-based sparse image matchers for diverse local features and identifies removing nearby keypoints as a critical design choice. By decoupling detectors from descriptors, it shows that detectors often drive transformer-based matching performance, while descriptors can be largely transferable with proper training. The authors propose a detector-agnostic fine-tuning strategy that trains existing LightGlue models with correspondences from multiple detectors while keeping descriptor networks fixed, achieving strong zero-shot generalization to unseen detectors across MegaDepth-1500, IMC2021, Aachen Day-Night, and InLoc. The results provide practical guidance for deploying transformer-based matchers, suggest ensemble strategies over multiple detectors, and open avenues for robust, cross-detector matching in real-world localization and mapping pipelines.

Abstract

We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.

Understanding and Optimizing Attention-Based Sparse Matching for Diverse Local Features

TL;DR

This work revisits training of attention-based sparse image matchers for diverse local features and identifies removing nearby keypoints as a critical design choice. By decoupling detectors from descriptors, it shows that detectors often drive transformer-based matching performance, while descriptors can be largely transferable with proper training. The authors propose a detector-agnostic fine-tuning strategy that trains existing LightGlue models with correspondences from multiple detectors while keeping descriptor networks fixed, achieving strong zero-shot generalization to unseen detectors across MegaDepth-1500, IMC2021, Aachen Day-Night, and InLoc. The results provide practical guidance for deploying transformer-based matchers, suggest ensemble strategies over multiple detectors, and open avenues for robust, cross-detector matching in real-world localization and mapping pipelines.

Abstract

We revisit the problem of training attention-based sparse image matching models for various local features. We first identify one critical design choice that has been previously overlooked, which significantly impacts the performance of the LightGlue model. We then investigate the role of detectors and descriptors within the transformer-based matching framework, finding that detectors, rather than descriptors, are often the primary cause for performance difference. Finally, we propose a novel approach to fine-tune existing image matching models using keypoints from a diverse set of detectors, resulting in a universal, detector-agnostic model. When deployed as a zero-shot matcher for novel detectors, the resulting model achieves or exceeds the accuracy of models specifically trained for those features. Our findings offer valuable insights for the deployment of transformer-based matching models and the future design of local features.
Paper Structure (16 sections, 10 figures, 4 tables)

This paper contains 16 sections, 10 figures, 4 tables.

Figures (10)

  • Figure 1: Existing training recipes for detector-based matching models, such as LightGlue, perform sub-optimally for certain keypoint detectors. We present a simple yet effective training strategy that substantially improves performance for these detectors. In addition, we introduce a zero-shot matching approach that generalizes to novel detectors without retraining, while achieving performance comparable to models trained individually for each detector, measured on the MegaDepth-1500 benchmark under AUC@5°.
  • Figure 2: LightGlue matching results for SiLK with and without NMS are shown. Without NMS, SiLK show cluttered keypoints, many of which the LightGlue model fails to match.
  • Figure 3: Relative pose estimation results with the official LightGlue models applied to novel detectors on the MegaDepth-1500 dataset. Naively applying pre-trained LightGlue models obtains degraded results.
  • Figure 4: Matching novel detectors with off-the-shelf LightGlue models by removing nearby keypoints via NMS or single-scale extraction. One obtains consistent gains with different matchers.
  • Figure 5: Left: The typical training approach for the LightGlue model, tailored to a single detector. Middle: Jointly training the descriptor network and LightGlue model toward a detector-agnostic matcher oblivious. Right: Our method of fine-tuning an existing LightGlue model.
  • ...and 5 more figures