Table of Contents
Fetching ...

Deep Homography Estimation for Visual Place Recognition

Feng Lu, Shuting Dong, Lijun Zhang, Bingxi Liu, Xiangyuan Lan, Dongmei Jiang, Chun Yuan

TL;DR

This paper tackles visual place recognition by replacing RANSAC-based geometric verification with a differentiable deep homography estimation (DHE) network in a two-stage VPR framework. The DHE network regresses a homography $\\mathbf{H}_{qc}$ from a dense local feature map to identify inliers via a re-projection inliers loss $L_r$, enabling end-to-end training with the backbone through the REI objective $L = L_g + \lambda L_r$. Key contributions include the architecture that jointly learns global retrieval and differentiable geometric verification, the REI loss that supplies supervision without explicit homography labels, and empirically strong results that outperform SOTA methods while delivering large speedups. The approach significantly reduces re-ranking time and improves robustness against perceptual aliasing, making it well suited for real-time, large-scale VPR tasks.

Abstract

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.

Deep Homography Estimation for Visual Place Recognition

TL;DR

This paper tackles visual place recognition by replacing RANSAC-based geometric verification with a differentiable deep homography estimation (DHE) network in a two-stage VPR framework. The DHE network regresses a homography from a dense local feature map to identify inliers via a re-projection inliers loss , enabling end-to-end training with the backbone through the REI objective . Key contributions include the architecture that jointly learns global retrieval and differentiable geometric verification, the REI loss that supplies supervision without explicit homography labels, and empirically strong results that outperform SOTA methods while delivering large speedups. The approach significantly reduces re-ranking time and improves robustness against perceptual aliasing, making it well suited for real-time, large-scale VPR tasks.

Abstract

Visual place recognition (VPR) is a fundamental task for many applications such as robot localization and augmented reality. Recently, the hierarchical VPR methods have received considerable attention due to the trade-off between accuracy and efficiency. They usually first use global features to retrieve the candidate images, then verify the spatial consistency of matched local features for re-ranking. However, the latter typically relies on the RANSAC algorithm for fitting homography, which is time-consuming and non-differentiable. This makes existing methods compromise to train the network only in global feature extraction. Here, we propose a transformer-based deep homography estimation (DHE) network that takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification. Moreover, we design a re-projection error of inliers loss to train the DHE network without additional homography labels, which can also be jointly trained with the backbone network to help it extract the features that are more suitable for local matching. Extensive experiments on benchmark datasets show that our method can outperform several state-of-the-art methods. And it is more than one order of magnitude faster than the mainstream hierarchical VPR methods using RANSAC. The code is released at https://github.com/Lu-Feng/DHE-VPR.
Paper Structure (17 sections, 11 equations, 5 figures, 5 tables)

This paper contains 17 sections, 11 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: The two-stage place retrieval with the proposed architecture. The backbone is applied to extract feature maps. The top branch yields global features for retrieving top-k candidate images. The bottom branch employs the local features for cross-matching and the DHE network for geometric verification via regressing homography. We count inliers as image similarity for re-ranking candidates.
  • Figure 2: Diagram of our re-ranking process with the DHE network. The feature maps $\boldsymbol{f}_q$ and $\boldsymbol{f}_c$ of the query image $I_q$ and a candidate image $I_c$ are fed into the Similarity Matching Module to compute the similarity map $\boldsymbol{s}_{qc}$. Then the Homography Regression Module uses the $\boldsymbol{s}_{qc}$ to yield the homography matrix $\mathbf{H}_{qc}$ for geometric verification of mutual matches.
  • Figure 3: Qualitative comparison of typical solution and our method. The top is the typical solution using RANSAC, which is non-differentiable, i.e. the backbone is only trained in global feature extraction. The bottom is ours, which yields more mutual nearest neighbors and inliers than the top. (When the re-projection threshold $\theta$ is set to 1.5× the patch size, ours yields 39 inliers, which is also more than the top.)
  • Figure 4: Qualitative results. In these challenging examples, our DHE-VPR successfully retrieves the correct images, while all other methods yield false places. In the second example, some other methods actually get database images that are geographically close to the query image, but their radius exceeds the threshold (25m). In the last example, the buildings on the left of the query image are occluded by vehicles and vegetation.
  • Figure 5: Recall@1-Runtime comparison of two-stage VPR methods on the Pitts30k dataset.