Render-and-Compare: Cross-View 6 DoF Localization from Noisy Prior
Shen Yan, Xiaoya Cheng, Yuxiang Liu, Juelin Zhu, Rouwan Wu, Yu Liu, Maojun Zhang
TL;DR
This work tackles cross-view visual localization for $6$-DoF pose estimation by bridging aerial oblique references with ground-level queries. It introduces a three-stage Render-and-Compare pipeline that starts from noisy sensor priors, augments seeds, renders synthetic views on a textured aerial mesh, and iteratively refines pose via 2D-2D to 2D-3D correspondences using $PnP$-LO-RANSAC. A new AirLoc dataset provides textured 3D aerial maps, ground-truth $6$-DoF poses, and Day-Night variations to benchmark cross-view localization. The method significantly outperforms state-of-the-art baselines on AirLoc, demonstrating the effectiveness of render-aligned matching over traditional real-to-real matching for large viewpoint changes, with ablations confirming the benefits of multiple iterations and seed augmentation for noisy priors.
Abstract
Despite the significant progress in 6-DoF visual localization, researchers are mostly driven by ground-level benchmarks. Compared with aerial oblique photography, ground-level map collection lacks scalability and complete coverage. In this work, we propose to go beyond the traditional ground-level setting and exploit the cross-view localization from aerial to ground. We solve this problem by formulating camera pose estimation as an iterative render-and-compare pipeline and enhancing the robustness through augmenting seeds from noisy initial priors. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of cross-view images from smartphones and drones and develop a semi-automatic system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate that our method outperforms other approaches by a large margin.
