Table of Contents
Fetching ...

LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment

Shuaibang Peng, Juelin Zhu, Xia Li, Kun Yang, Maojun Zhang, Yu Liu, Shen Yan

Abstract

We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.

LoD-Loc v3: Generalized Aerial Localization in Dense Cities using Instance Silhouette Alignment

Abstract

We present LoD-Loc v3, a novel method for generalized aerial visual localization in dense urban environments. While prior work LoD-Loc v2 achieves localization through semantic building silhouette alignment with low-detail city models, it suffers from two key limitations: poor cross-scene generalization and frequent failure in dense building scenes. Our method addresses these challenges through two key innovations. First, we develop a new synthetic data generation pipeline that produces InsLoD-Loc - the largest instance segmentation dataset for aerial imagery to date, comprising 100k images with precise instance building annotations. This enables trained models to exhibit remarkable zero-shot generalization capability. Second, we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which significantly reduces pose estimation ambiguity in dense scenes. Extensive experiments demonstrate that LoD-Loc v3 outperforms existing state-of-the-art (SOTA) baselines, achieving superior performance in both cross-scene and dense urban scenarios with a large margin. The project is available at https://nudt-sawlab.github.io/LoD-Locv3/.
Paper Structure (14 sections, 4 equations, 4 figures, 6 tables)

This paper contains 14 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: In this paper, (a) we introduce LoD-Loc v3 to address two critical challenges in aerial localization over LoD city models: cross-scene generalization and the ambiguity problem in dense urban scenes. Our solutions are twofold: (b) we construct InsLoD -Loc, a large-scale synthetic dataset covering 40 distinct areas for model zero-shot training, and (c) we reformulate the localization paradigm by shifting from semantic to instance silhouette alignment, which provides superior convergence.
  • Figure 2: Visualization of instanced LoD models and labels. The top row displays the instanced LoD models. The bottom row illustrates the rendered instance labels rendered from the models, alongside their corresponding semantic labels.
  • Figure 3: Overview of the InsLoD-Loc dataset. The left panel illustrates the geographic distribution of the 40 flight areas across Europe and Asia. The right panel showcases representative samples from the dataset, each displaying (from left to right): a photorealistic RGB query image and its corresponding pixel-accurate instance label, where each color represents a unique building.
  • Figure 4: Visualization of localization results on the Tokyo-LoDv3 dataset. Superimposed instance masks rendered at estimated poses demonstrate that our method effectively resolves ambiguity in dense urban scenes. The columns from left to right show: query image, prior pose, LoD-Loc v2, LoD-Loc v3 (Ours), and ground truth.