Table of Contents
Fetching ...

ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses

Junjie Ni, Guofeng Zhang, Guanglin Li, Yijin Li, Xinyang Liu, Zhaoyang Huang, Hujun Bao

TL;DR

ETO introduces an efficient transformer-based local feature matching framework by organizing multiple homography hypotheses to cover planar scene structure and employing a single, uni-directional cross-attention refinement. This approach reduces the number of patch tokens fed to the transformer and accelerates refinement while preserving high matching accuracy, achieving up to roughly 4x speedups over LoFTR on challenging outdoor datasets and maintaining competitive performance on Megadepth, YFCC100M, ScanNet, and HPatches. Key contributions include (i) a hypothesis-based coarse matching strategy that segments patches by local homographies, (ii) a segmentation-guided re-selection mechanism, and (iii) a streamlined refinement stage with uni-directional attention. The method offers practical gains for downstream tasks like SLAM, 3D reconstruction, and visual localization by delivering fast, accurate local feature matching with reduced computational burden.

Abstract

We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.

ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses

TL;DR

ETO introduces an efficient transformer-based local feature matching framework by organizing multiple homography hypotheses to cover planar scene structure and employing a single, uni-directional cross-attention refinement. This approach reduces the number of patch tokens fed to the transformer and accelerates refinement while preserving high matching accuracy, achieving up to roughly 4x speedups over LoFTR on challenging outdoor datasets and maintaining competitive performance on Megadepth, YFCC100M, ScanNet, and HPatches. Key contributions include (i) a hypothesis-based coarse matching strategy that segments patches by local homographies, (ii) a segmentation-guided re-selection mechanism, and (iii) a streamlined refinement stage with uni-directional attention. The method offers practical gains for downstream tasks like SLAM, 3D reconstruction, and visual localization by delivering fast, accurate local feature matching with reduced computational burden.

Abstract

We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.

Paper Structure

This paper contains 15 sections, 8 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: ETO goes beyond the Pareto curve between accuracy and Efficiency. This figure shows the performance of different state-of-the-art methods on YFCC100M. We take into account the time for the extraction of feature and their description. LightGlue and LightGlue* are different settings of LightGlue.
  • Figure 2: As demonstrated in the figure, there exists a correspondence between two red regions on the sphere. In contrast to uniform hypotheses, homography hypotheses approximate the correspondence function better, which allows for more precise matching results with fewer computational resources.
  • Figure 3: Given the source image $\mathcal{S}$ and target image $\mathcal{T}$, we first use a U-Net like feature extractor to get images' feature map at different resolution: $M_1$ (H/32 $\times$ W/32), $M_2$ (H/8 $\times$ W/8) and $M_3$ (H/2 $\times$ W/2). We use local $3\times 3$ patches to illustrate our method: (a) We estimate homography hypotheses $H_i$ for every feature after performing transformer. (b) We segment the map from these hypotheses to minimize projection errors. With the applied homography matrix $\hat{H}_j$, we can project the chosen source point $P_j^s$ to target point $P_j^t$ . (c) We update the $P_j^t$ to $P_j^{t*}$ after a uni-directional cross attention. The training process is split into two parts, the coarse and the fine. We train the coarse part with $L_H$, while training the fine part with $L_s$ and $L_r$.
  • Figure 4: Any unit $j$ on $M_2$ should be classified for a hypotheses in $\mathcal{H}$ to minimize projection error. Each $H_i$ describes a plane.