Table of Contents
Fetching ...

LightGlue: Local Feature Matching at Light Speed

Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys

TL;DR

LightGlue tackles the efficiency gap in deep sparse feature matching by introducing an adaptive Transformer-based matcher that can halt computation early for easy image pairs. It replaces the heavy Sinkhorn-based optimization of previous work with a lightweight, per-layer correspondence head and a confidence-driven exit mechanism, while using relative positional encodings and bidirectional attention to maintain accuracy. Through synthetic homography pretraining and MegaDepth finetuning, LightGlue achieves state-of-the-art or competitive results with substantially reduced runtime, and ablations highlight the critical role of matchability, adaptivity, and deep supervision. The method is demonstrated across HPatches, MegaDepth, and Aachen Day-Night, showing strong performance for SLAM-scale localization and visual reconstruction tasks, with code publicly available.

Abstract

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at https://github.com/cvg/LightGlue.

LightGlue: Local Feature Matching at Light Speed

TL;DR

LightGlue tackles the efficiency gap in deep sparse feature matching by introducing an adaptive Transformer-based matcher that can halt computation early for easy image pairs. It replaces the heavy Sinkhorn-based optimization of previous work with a lightweight, per-layer correspondence head and a confidence-driven exit mechanism, while using relative positional encodings and bidirectional attention to maintain accuracy. Through synthetic homography pretraining and MegaDepth finetuning, LightGlue achieves state-of-the-art or competitive results with substantially reduced runtime, and ablations highlight the critical role of matchability, adaptivity, and deep supervision. The method is demonstrated across HPatches, MegaDepth, and Aachen Day-Night, showing strong performance for SLAM-scale localization and visual reconstruction tasks, with code publicly available.

Abstract

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at https://github.com/cvg/LightGlue.
Paper Structure (24 sections, 13 equations, 13 figures, 11 tables)

This paper contains 24 sections, 13 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: LightGlue matches sparse features faster and better than existing approaches like SuperGlue. Its adaptive stopping mechanism gives a fine-grained control over the speed vs. accuracy trade-off. Our final, optimized model $\star$ delivers an accuracy closer to the dense matcher LoFTR at an 8$\times$ higher speed, here in typical outdoor conditions.
  • Figure 2: Depth adaptivity. LigthGlue is faster at matching easy image pairs (top) than difficult ones (bottom) because it can stop at earlier layers when its predictions are confident.
  • Figure 3: The LightGlue architecture. Given a pair of input local features ($\bm{\mathrm{d}},\bm{\mathrm{p}}$), each layer augments the visual descriptors ($\color{red}\bullet$,$\color{blue}\bullet$) with context based on self- and cross-attention units with positional encoding $\odot$. A confidence classifier $c$ helps decide whether to stop the inference. If few points are confident, the inference proceeds to the next layer but we prune points that are confidently unmatchable. Once a confident state if reached, LightGlue predicts an assignment between points based on their pariwise similarity and unary matchability.
  • Figure 4: Point pruning. As LigthGlue aggregates context, it can find out early that some points ($\color{red}\bullet$) are unmatchable and thus exclude them from subsequent layers. Other, non-repeatable points are excluded in later layers: ${\color{orange}\bullet} \rightarrow {\color{yellow}\bullet}\rightarrow {\color{green}\bullet}$. This reduces the inference time and the search space ($\color{blue}\bullet$) to ultimately find good matches fast.
  • Figure 5: Ease of training. The LightGlue architecture vastly improves the speed of convergence of the pre-training on synthetic homographies. After 5M image pairs (only 2 GPU-days), LighGlue achieves -33% loss at the final layer and +4% match recall. SuperGlue requires over 7 days of training to reach a similar accuracy.
  • ...and 8 more figures