Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

Shen Yan; Xiaoya Cheng; Yuxiang Liu; Juelin Zhu; Rouwan Wu; Yu Liu; Maojun Zhang

Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

Shen Yan, Xiaoya Cheng, Yuxiang Liu, Juelin Zhu, Rouwan Wu, Yu Liu, Maojun Zhang

TL;DR

This work tackles cross-view visual localization for $6$-DoF pose estimation by bridging aerial oblique references with ground-level queries. It introduces a three-stage Render-and-Compare pipeline that starts from noisy sensor priors, augments seeds, renders synthetic views on a textured aerial mesh, and iteratively refines pose via 2D-2D to 2D-3D correspondences using $PnP$-LO-RANSAC. A new AirLoc dataset provides textured 3D aerial maps, ground-truth $6$-DoF poses, and Day-Night variations to benchmark cross-view localization. The method significantly outperforms state-of-the-art baselines on AirLoc, demonstrating the effectiveness of render-aligned matching over traditional real-to-real matching for large viewpoint changes, with ablations confirming the benefits of multiple iterations and seed augmentation for noisy priors.

Abstract

Despite the significant progress in 6-DoF visual localization, researchers are mostly driven by ground-level benchmarks. Compared with aerial oblique photography, ground-level map collection lacks scalability and complete coverage. In this work, we propose to go beyond the traditional ground-level setting and exploit the cross-view localization from aerial to ground. We solve this problem by formulating camera pose estimation as an iterative render-and-compare pipeline and enhancing the robustness through augmenting seeds from noisy initial priors. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of cross-view images from smartphones and drones and develop a semi-automatic system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate that our method outperforms other approaches by a large margin.

Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

TL;DR

This work tackles cross-view visual localization for

-DoF pose estimation by bridging aerial oblique references with ground-level queries. It introduces a three-stage Render-and-Compare pipeline that starts from noisy sensor priors, augments seeds, renders synthetic views on a textured aerial mesh, and iteratively refines pose via 2D-2D to 2D-3D correspondences using

-LO-RANSAC. A new AirLoc dataset provides textured 3D aerial maps, ground-truth

-DoF poses, and Day-Night variations to benchmark cross-view localization. The method significantly outperforms state-of-the-art baselines on AirLoc, demonstrating the effectiveness of render-aligned matching over traditional real-to-real matching for large viewpoint changes, with ablations confirming the benefits of multiple iterations and seed augmentation for noisy priors.

Abstract

Paper Structure (29 sections, 4 equations, 12 figures, 4 tables)

This paper contains 29 sections, 4 equations, 12 figures, 4 tables.

Introduction
Related Works
Structured localization
Cross-view Geo-localization
Synthesis localization
Localization Datasets
Method
Prior Pose Generation
View Synthesis
Pose Correction
Dataset
Reference Map Collection
Query Image Collection
Query GT Generation
Experiment
...and 14 more sections

Figures (12)

Figure 1: Cross-view 6-DoF localization. The proposed benchmark dataset AirLoc exhibits drastic view changes between query and reference. The reference map is captured using a pentacular oblique camera above $100$ meters, while the query images are sampled close to the ground with small drones and smartphones, respectively. Query images include real Day-and-Night environments.
Figure 2: Overview of the proposed method. (a). For each prior pose $_s\boldsymbol{\mathcal{\xi}}^q$, we first eliminate the noise by adding several random seeds $\{_s\boldsymbol{\mathcal{\xi}}^q_1, ..., _s\boldsymbol{\mathcal{\xi}}^q_k\}$. Then we choose one by calculating the maximum inlier matching number as $\boldsymbol{\mathcal{\xi}}^q_{t_1}$. The virtual pose experiences a Render-and-Compare update process towards GT target $^*\boldsymbol{\mathcal{\xi}}^q$, varies from intermediate $\boldsymbol{\mathcal{\xi}}^q_{t_i}$ at step $i$ to final $\boldsymbol{\mathcal{\xi}}^q_{t_h}$ at step $h$. (b,c,d). The feature correspondences are visualized between the query and rendered image during the iterative refinement, where warmer colors indicate higher confidence. The matching results improve a lot along with the sequential adjustments.
Figure 3: Rendering result visualization. An example of the rendering results is provided. (a) shows the synthesized view, while (b) illustrates the depthmap.
Figure 4: Alignment quality of the aerial-to-ground reconstruction on AirLoc. The dark black model comes from aerial oblique photography, while the yellow model is built from a sequence of ground cellphone photos. The accuracy of the alignment can be observed in, for instance, the agreement of corners and edges.
Figure 5: GT poses quality on AirLoc. Pixel-aligned renderings of the estimated camera pose confirm that the poses are sufficiently accurate for our evaluation.
...and 7 more figures

Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

TL;DR

Abstract

Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

Authors

TL;DR

Abstract

Table of Contents

Figures (12)