Table of Contents
Fetching ...

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, Andre Araujo

TL;DR

OmniGlue addresses the limited generalization of learnable image matchers by introducing foundation-model guidance from DINOv2 and a position-disentangled attention mechanism that separates spatial from appearance information during feature propagation. The method builds intra- and inter-image graphs, prunes cross-image connections with DINOv2 similarities, and refines descriptors through attention blocks that incorporate positional context without embedding it into the final descriptors. Across seven diverse datasets, OmniGlue delivers strong cross-domain gains—up to $20.9\%$ relative over prior work and $9.5\%$ over LightGlue—while maintaining competitive in-domain performance and enabling effective few-shot adaptation. This highlights a practical path toward robust, domain-agnostic image matching suitable for real-world pose estimation and registration tasks.

Abstract

The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques, with ever-improving performance on conventional benchmarks. However, our investigation shows that despite these gains, their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper, we introduce OmniGlue, the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process, boosting generalization to domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information, leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of $7$ datasets with varied image domains, including scene-level, object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of $20.9\%$ with respect to a directly comparable reference model, while also outperforming the recent LightGlue method by $9.5\%$ relatively.Code and model can be found at https://hwjiang1510.github.io/OmniGlue

OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

TL;DR

OmniGlue addresses the limited generalization of learnable image matchers by introducing foundation-model guidance from DINOv2 and a position-disentangled attention mechanism that separates spatial from appearance information during feature propagation. The method builds intra- and inter-image graphs, prunes cross-image connections with DINOv2 similarities, and refines descriptors through attention blocks that incorporate positional context without embedding it into the final descriptors. Across seven diverse datasets, OmniGlue delivers strong cross-domain gains—up to relative over prior work and over LightGlue—while maintaining competitive in-domain performance and enabling effective few-shot adaptation. This highlights a practical path toward robust, domain-agnostic image matching suitable for real-world pose estimation and registration tasks.

Abstract

The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques, with ever-improving performance on conventional benchmarks. However, our investigation shows that despite these gains, their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper, we introduce OmniGlue, the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process, boosting generalization to domains not seen at training time. Additionally, we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information, leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of datasets with varied image domains, including scene-level, object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of with respect to a directly comparable reference model, while also outperforming the recent LightGlue method by relatively.Code and model can be found at https://hwjiang1510.github.io/OmniGlue
Paper Structure (16 sections, 1 equation, 7 figures, 7 tables)

This paper contains 16 sections, 1 equation, 7 figures, 7 tables.

Figures (7)

  • Figure 1: OmniGlue is a generalizable learnable matcher. Introducing foundation model guidance and an enhanced attention mechanism, OmniGlue learns effective image matching that transfers well to image domains not seen during training. We compare it against reference methods SIFT Lowe2004sift and SuperGlue sarlin2020superglue, with substantial improvements on a suite of diverse datasets: outdoor scenes (MegaDepth-1500 Li2018MegaDepth pose AUC@$5\degree$), indoor scenes (ScanNet dai2017scannet pose accuracy @$5\degree$), aerial scenes (DeepAerial park2020two PCK@$1\%$) and object-centric images (GSO-Hard Downs2022GoogleSO and NAVI-MultiView / NAVI-Wild jampani2023navi, pose accuracy @$5\degree$).
  • Figure 2: OmniGlue overview. We use frozen DINO and SuperPoint to detect keypoints and extract features. Then, we build densely connected intra-image keypoint graphs and leverage DINO features to build inter-image graphs. We refine the keypoint features based on the constructed graphs, performing information propagation. In this process, we use keypoint positions solely for guidance, disentangling them from the keypoint local descriptors. Finally, the matching results are produced based on the updated keypoint local descriptors.
  • Figure 3: (Left) Building inter-image graph. We prune the dense pairwise graph based on the DINO feature similarity. (Right) Position-guided attention. The keypoint position is involved in computing attention weights, while the output attention update is only composed of local descriptor components.
  • Figure 4: Visualization of correspondences predicted by OmniGlue on the MegaDepth-1500 benchmark. We distinguish the matches by different colors. We show results for scene "0022" and "0015" on the top and bottom rows, respectively.
  • Figure 5: Zero-shot generalization to novel domains. The top and middle row show results on GSO and NAVI, the last row shows results on ScanNet and DeepAerial. We draw the correct and incorrect estimated correspondences as green and red, respectively.
  • ...and 2 more figures