Table of Contents
Fetching ...

Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

Shen Yan, Xiaoya Cheng, Yuxiang Liu, Juelin Zhu, Rouwan Wu, Yu Liu, Maojun Zhang

TL;DR

This work tackles cross-view visual localization for $6$-DoF pose estimation by bridging aerial oblique references with ground-level queries. It introduces a three-stage Render-and-Compare pipeline that starts from noisy sensor priors, augments seeds, renders synthetic views on a textured aerial mesh, and iteratively refines pose via 2D-2D to 2D-3D correspondences using $PnP$-LO-RANSAC. A new AirLoc dataset provides textured 3D aerial maps, ground-truth $6$-DoF poses, and Day-Night variations to benchmark cross-view localization. The method significantly outperforms state-of-the-art baselines on AirLoc, demonstrating the effectiveness of render-aligned matching over traditional real-to-real matching for large viewpoint changes, with ablations confirming the benefits of multiple iterations and seed augmentation for noisy priors.

Abstract

Despite the significant progress in 6-DoF visual localization, researchers are mostly driven by ground-level benchmarks. Compared with aerial oblique photography, ground-level map collection lacks scalability and complete coverage. In this work, we propose to go beyond the traditional ground-level setting and exploit the cross-view localization from aerial to ground. We solve this problem by formulating camera pose estimation as an iterative render-and-compare pipeline and enhancing the robustness through augmenting seeds from noisy initial priors. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of cross-view images from smartphones and drones and develop a semi-automatic system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate that our method outperforms other approaches by a large margin.

Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior

TL;DR

This work tackles cross-view visual localization for -DoF pose estimation by bridging aerial oblique references with ground-level queries. It introduces a three-stage Render-and-Compare pipeline that starts from noisy sensor priors, augments seeds, renders synthetic views on a textured aerial mesh, and iteratively refines pose via 2D-2D to 2D-3D correspondences using -LO-RANSAC. A new AirLoc dataset provides textured 3D aerial maps, ground-truth -DoF poses, and Day-Night variations to benchmark cross-view localization. The method significantly outperforms state-of-the-art baselines on AirLoc, demonstrating the effectiveness of render-aligned matching over traditional real-to-real matching for large viewpoint changes, with ablations confirming the benefits of multiple iterations and seed augmentation for noisy priors.

Abstract

Despite the significant progress in 6-DoF visual localization, researchers are mostly driven by ground-level benchmarks. Compared with aerial oblique photography, ground-level map collection lacks scalability and complete coverage. In this work, we propose to go beyond the traditional ground-level setting and exploit the cross-view localization from aerial to ground. We solve this problem by formulating camera pose estimation as an iterative render-and-compare pipeline and enhancing the robustness through augmenting seeds from noisy initial priors. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of cross-view images from smartphones and drones and develop a semi-automatic system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate that our method outperforms other approaches by a large margin.
Paper Structure (29 sections, 4 equations, 12 figures, 4 tables)

This paper contains 29 sections, 4 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Cross-view 6-DoF localization. The proposed benchmark dataset AirLoc exhibits drastic view changes between query and reference. The reference map is captured using a pentacular oblique camera above $100$ meters, while the query images are sampled close to the ground with small drones and smartphones, respectively. Query images include real Day-and-Night environments.
  • Figure 2: Overview of the proposed method. (a). For each prior pose $_s\boldsymbol{\mathcal{\xi}}^q$, we first eliminate the noise by adding several random seeds $\{_s\boldsymbol{\mathcal{\xi}}^q_1, ..., _s\boldsymbol{\mathcal{\xi}}^q_k\}$. Then we choose one by calculating the maximum inlier matching number as $\boldsymbol{\mathcal{\xi}}^q_{t_1}$. The virtual pose experiences a Render-and-Compare update process towards GT target $^*\boldsymbol{\mathcal{\xi}}^q$, varies from intermediate $\boldsymbol{\mathcal{\xi}}^q_{t_i}$ at step $i$ to final $\boldsymbol{\mathcal{\xi}}^q_{t_h}$ at step $h$. (b,c,d). The feature correspondences are visualized between the query and rendered image during the iterative refinement, where warmer colors indicate higher confidence. The matching results improve a lot along with the sequential adjustments.
  • Figure 3: Rendering result visualization. An example of the rendering results is provided. (a) shows the synthesized view, while (b) illustrates the depthmap.
  • Figure 4: Alignment quality of the aerial-to-ground reconstruction on AirLoc. The dark black model comes from aerial oblique photography, while the yellow model is built from a sequence of ground cellphone photos. The accuracy of the alignment can be observed in, for instance, the agreement of corners and edges.
  • Figure 5: GT poses quality on AirLoc. Pixel-aligned renderings of the estimated camera pose confirm that the poses are sufficiently accurate for our evaluation.
  • ...and 7 more figures