Table of Contents
Fetching ...

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, Xiaowei Zhou

TL;DR

Efficient LoFTR tackles the efficiency gap in detector-free semi-dense image matching by introducing an Aggregated Attention Module that reduces token counts and a Two-Stage Correlation Refinement for robust subpixel accuracy. The method preserves or improves accuracy while achieving roughly $2.5\times$ speedups over LoFTR and competitive performance against LightGlue in efficient regimes. It demonstrates strong results across relative pose, homography, and visual localization benchmarks, confirming its practicality for large-scale and latency-sensitive tasks. By rethinking local attention and refinement strategies, the approach enables robust, fast matching under challenging conditions such as large viewpoint changes and texture-poor regions.

Abstract

We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is $\sim 2.5\times$ faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr.

Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed

TL;DR

Efficient LoFTR tackles the efficiency gap in detector-free semi-dense image matching by introducing an Aggregated Attention Module that reduces token counts and a Two-Stage Correlation Refinement for robust subpixel accuracy. The method preserves or improves accuracy while achieving roughly speedups over LoFTR and competitive performance against LightGlue in efficient regimes. It demonstrates strong results across relative pose, homography, and visual localization benchmarks, confirming its practicality for large-scale and latency-sensitive tasks. By rethinking local attention and refinement strategies, the approach enables robust, fast matching under challenging conditions such as large viewpoint changes and texture-poor regions.

Abstract

We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information, therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore, we find spatial variance exists in LoFTR's fine correlation module, which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is faster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover, extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers, with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr.
Paper Structure (38 sections, 7 equations, 5 figures, 14 tables)

This paper contains 38 sections, 7 equations, 5 figures, 14 tables.

Figures (5)

  • Figure 1: Matching Accuracy and Efficiency Comparisons. Our method achieves competitive accuracy compared with semi-dense matchers (*B) at a significantly higher speed. Compared with dense matcher ROMA (*B), our method is $\sim 7.5\times$ faster. Moreover, our efficiency optimized model (*B) can surpass the robust sparse matching pipeline (*B) SuperPoint (SP) + LightGlue (LG) on efficiency with considerably better accuracy.
  • Figure 2: Pipeline Overview.(1) Given an image pair, a CNN network extracts coarse feature maps $\tilde{\textbf{F}}_A$ and $\tilde{\textbf{F}}_B$, as well as fine features. (2) Then, we transform coarse features for more discriminative feature maps by interleaving our aggregated self- and cross-attention $N$ times, where adaptively feature aggregation is performed to reduce token size before each attention for efficiency. (3) Transformed coarse features are correlated for the score matrix $\mathcal{S}$. Mutual-nearest-neighbor (MNN) searching is followed to establish coarse matches $\{\mathcal{M}_c\}$. (4) To refine coarse matches, discriminative fine features $\hat{\textbf{F}}_A^t$, $\hat{\textbf{F}}_B^t$ in full resolution are obtained by fusing transformed coarse features $\tilde{\textbf{F}}_A^t$, $\tilde{\textbf{F}}_B^t$ with backbone features. Feature patches are then cropped centered at each coarse match $\mathcal{M}_c$. A two-stage refinement is followed to obtain sub-pixel correspondence $\mathcal{M}_f$.
  • Figure 3: Detailed Transformer Module Comparison. Unlike LoFTR which uses all tokens of feature maps to compute attention and resort to linear attention to reduce the computational cost, the proposed attention module first aggregates features for salient tokens, which is significantly more efficient for attention. Then the vanilla attention is utilized to transform aggregated features, where relative positional encoding is inserted to capture the spatial information. Transformed features are upsampled and fused with the original features to form the final features.
  • Figure 4: Qualitative Results. Our method is compared with the sparse matching pipeline SuperPoint DeTone2017SuperPointSI+LightGlue lindenberger2023lightglue, semi-dense matcher AspanFormer chen2022aspanformer. Image pairs with texture-poor regions and large-viewpoint changes can be robustly matched by our method. The red color indicates epipolar error beyond $5 \times 10^{-4}$ (in the normalized image coordinates).
  • Figure 5: Qualitative Results. Our method is compared with the sparse matching pipeline SuperPoint DeTone2017SuperPointSI+LightGlue lindenberger2023lightglue, semi-dense matcher AspanFormer chen2022aspanformer. The red color indicates epipolar error beyond $5 \times 10^{-4}$ on ScanNet and $1 \times 10^{-4}$ on MegaDepth (in the normalized image coordinates). Since no ground-truth pose is available on InLoc dataset, we color the match with predicted confidence. Red indicates higher confidence and blue for the opposite.