Learning to Produce Semi-dense Correspondences for Visual Localization

Khang Truong Giang; Soohwan Song; Sungho Jo

Learning to Produce Semi-dense Correspondences for Visual Localization

Khang Truong Giang, Soohwan Song, Sungho Jo

TL;DR

This study proposes a novel localization method that extracts reliable semi-dense 2D-3D matching points based on dense keypoint matches using a point inference network and achieves competitive results in large-scale visual localization benchmarks.

Abstract

This study addresses the challenge of performing visual localization in demanding conditions such as night-time scenarios, adverse weather, and seasonal changes. While many prior studies have focused on improving image-matching performance to facilitate reliable dense keypoint matching between images, existing methods often heavily rely on predefined feature points on a reconstructed 3D model. Consequently, they tend to overlook unobserved keypoints during the matching process. Therefore, dense keypoint matches are not fully exploited, leading to a notable reduction in accuracy, particularly in noisy scenes. To tackle this issue, we propose a novel localization method that extracts reliable semi-dense 2D-3D matching points based on dense keypoint matches. This approach involves regressing semi-dense 2D keypoints into 3D scene coordinates using a point inference network. The network utilizes both geometric and visual cues to effectively infer 3D coordinates for unobserved keypoints from the observed ones. The abundance of matching information significantly enhances the accuracy of camera pose estimation, even in scenarios involving noisy or sparse 3D models. Comprehensive evaluations demonstrate that the proposed method outperforms other methods in challenging scenes and achieves competitive results in large-scale visual localization benchmarks. The code will be available.

Learning to Produce Semi-dense Correspondences for Visual Localization

TL;DR

Abstract

Paper Structure (23 sections, 10 equations, 9 figures, 6 tables)

This paper contains 23 sections, 10 equations, 9 figures, 6 tables.

Introduction
Related Works
Proposed Method
Overview
Point Inference Network
Confidence-based Point Aggregation
Loss Functions
Experiments
Implementation Details
Evaluation on Cambridge and 7scenes
Evaluation on large-scale challenging scenes
Ablation Study
Conclusions and Limitations
Acknowledgments
Method Details
...and 8 more sections

Figures (9)

Figure 1: The comparison of the 2D-3D correspondence finding process in our method (DeViLoc) and an existing method (HLoc sarlin2019coarse). HLoc heavily relies on a robust 3D point cloud but discards many detected 2D keypoints (depicted in green) during the 2D-3D matching process. In contrast, our method efficiently handles a noisy point cloud through the point inference process of PIN. PIN transforms the entire set of 2D-2D matches into 2D-3D matches. Our method then produces numerous accurate 2D-3D matches across multiple views using a confidence-based aggregation module. These abundant matches significantly enhance localization performance, particularly in scenarios characterized by noisy or sparse 3D point clouds.
Figure 2: Overview of DeViLoc. First, a feature matcher is employed to detect 2D-2D matches for each pair of query-reference images. Subsequently, the PIN module infers a set of 3D coordinates for all detected 2D keypoints based on the observed data in the reference image. Finally, the CPA module integrates all 2D-3D matches obtained across all query-reference pairs.
Figure 3: Point Inference Network (PIN). The network begins by learning embeddings for all keypoints $(K_{emb}^o, K_{emb}^r)$ and observed depths $(D_{emb}^o)$. Subsequently, attention layers are employed for both geometric and visual guidance. Finally, the learned latent codes $(P_{lc}^r)$ are utilized to perform regression for the 3D points along with confidence values.
Figure 4: Comparison between point clouds built from traditional FM (SIFT lowe2004distinctive), sparse FM (SP+SG detone2018superpointsarlin2020superglue), and detector-free FM (LoFTR sun2021loftr). DeViLoc can handle well the noisy SIFT-based input to achieve competitive performance compared to the precise (SP+SG) or dense (LoFTR) inputs (shown in Table \ref{['table:ablation_pointclouds']}).
Figure 5: Illustration of 2D-3D correspondences estimated by DeViLoc for several pairs of images. The observed 2D keypoints are marked in black, while the reference keypoints are represented in orange (low confidence) or green (high confidence).
...and 4 more figures

Learning to Produce Semi-dense Correspondences for Visual Localization

TL;DR

Abstract

Learning to Produce Semi-dense Correspondences for Visual Localization

Authors

TL;DR

Abstract

Table of Contents

Figures (9)