Table of Contents
Fetching ...

DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization

Tao Liu, Kan Ren, Qian Chen

TL;DR

This work tackles cross-view UAV localization in GNSS-denied environments by introducing DiffusionUavLoc, a text-free, diffusion-based framework that uses training-free geometric orthophotos as visual prompts and a VAE latent space for retrieval. It fuses multi-modal structural priors (edges, semantics, depth) through ControlNet to condition a diffusion model, and learns unified UAV-satellite descriptors without iterative denoising. A multi-objective, uncertainty-weighted loss ensures sharp textures and faithful structure while aligning cross-view geometry, yielding state-of-the-art satellite-to-drone performance on University-1652 and robust results across altitude variations on SUES-200. The approach is practical, avoiding reliance on text prompts and enabling fast descriptor-based retrieval, with strong visualization evidence of geometry-aligned activations. This has meaningful implications for reliable UAV localization in environments where GNSS is compromised and may generalize to other cross-view localization tasks.

Abstract

With the rapid growth of the low-altitude economy, unmanned aerial vehicles (UAVs) have become key platforms for measurement and tracking in intelligent patrol systems. However, in GNSS-denied environments, localization schemes that rely solely on satellite signals are prone to failure. Cross-view image retrieval-based localization is a promising alternative, yet substantial geometric and appearance domain gaps exist between oblique UAV views and nadir satellite orthophotos. Moreover, conventional approaches often depend on complex network architectures, text prompts, or large amounts of annotation, which hinders generalization. To address these issues, we propose DiffusionUavLoc, a cross-view localization framework that is image-prompted, text-free, diffusion-centric, and employs a VAE for unified representation. We first use training-free geometric rendering to synthesize pseudo-satellite images from UAV imagery as structural prompts. We then design a text-free conditional diffusion model that fuses multimodal structural cues to learn features robust to viewpoint changes. At inference, descriptors are computed at a fixed time step t and compared using cosine similarity. On University-1652 and SUES-200, the method performs competitively for cross-view localization, especially for satellite-to-drone in University-1652.Our data and code will be published at the following URL: https://github.com/liutao23/DiffusionUavLoc.git.

DiffusionUavLoc: Visually Prompted Diffusion for Cross-View UAV Localization

TL;DR

This work tackles cross-view UAV localization in GNSS-denied environments by introducing DiffusionUavLoc, a text-free, diffusion-based framework that uses training-free geometric orthophotos as visual prompts and a VAE latent space for retrieval. It fuses multi-modal structural priors (edges, semantics, depth) through ControlNet to condition a diffusion model, and learns unified UAV-satellite descriptors without iterative denoising. A multi-objective, uncertainty-weighted loss ensures sharp textures and faithful structure while aligning cross-view geometry, yielding state-of-the-art satellite-to-drone performance on University-1652 and robust results across altitude variations on SUES-200. The approach is practical, avoiding reliance on text prompts and enabling fast descriptor-based retrieval, with strong visualization evidence of geometry-aligned activations. This has meaningful implications for reliable UAV localization in environments where GNSS is compromised and may generalize to other cross-view localization tasks.

Abstract

With the rapid growth of the low-altitude economy, unmanned aerial vehicles (UAVs) have become key platforms for measurement and tracking in intelligent patrol systems. However, in GNSS-denied environments, localization schemes that rely solely on satellite signals are prone to failure. Cross-view image retrieval-based localization is a promising alternative, yet substantial geometric and appearance domain gaps exist between oblique UAV views and nadir satellite orthophotos. Moreover, conventional approaches often depend on complex network architectures, text prompts, or large amounts of annotation, which hinders generalization. To address these issues, we propose DiffusionUavLoc, a cross-view localization framework that is image-prompted, text-free, diffusion-centric, and employs a VAE for unified representation. We first use training-free geometric rendering to synthesize pseudo-satellite images from UAV imagery as structural prompts. We then design a text-free conditional diffusion model that fuses multimodal structural cues to learn features robust to viewpoint changes. At inference, descriptors are computed at a fixed time step t and compared using cosine similarity. On University-1652 and SUES-200, the method performs competitively for cross-view localization, especially for satellite-to-drone in University-1652.Our data and code will be published at the following URL: https://github.com/liutao23/DiffusionUavLoc.git.

Paper Structure

This paper contains 33 sections, 19 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Motivation and overview of DiffusionUavLoc.(a) Main challenges. Cross-view UAV–satellite matching suffers from large gaps in imaging geometry and appearance: oblique, low-altitude UAV views versus near-nadir satellite orthophotos, scale and occlusion changes, adverse conditions (fog, darkness, snow), and the presence or absence of moving targets. (b) Bottlenecks of existing methods. These factors cause out-of-domain degradation, where homologous urban pairs are matched while heterogeneous natural scenes often fail; representative labeled drone–satellite pairs are illustrated. GAN-based translation is training-unstable, and standard diffusion pipelines rely on text prompts, which provide weak geometric guidance and yield blurry or misaligned cross-view correspondences across timesteps. (c) Our approach. We render high-resolution orthophotos from oblique UAV imagery via training-free geometric orthorectification and projection refinement to produce pseudo-satellite image prompts. Multimodal structural cues (e.g., Canny edges, SAM masks, and DepthAnything-v2 depth) are fused and injected through ControlNet into a text-free conditional diffusion decoder with a VAE encoder for unified representation, producing virtual-satellite views supervised by real satellite targets. (d) Comparison. On University-1652, the proposed method occupies the top-right region of the accuracy plot, indicating competitive, often superior, cross-view retrieval performance for the satellite to drone task.
  • Figure 2: Traditional orthophoto rendering outcomes from left to right: the UAV viewpoint, failure of COLMAP reconstruction, consequences of perspective transformation and content loss, and the satellite image on the far right.
  • Figure 3: Qualitative response heatmaps on University-1652. For multiple drone viewpoints and the corresponding satellite reference, our method produces compact and spatially aligned activations that concentrate on geometry-consistent regions such as building footprints, long straight edges, and stable man-made structures. The hotspots remain co-located across-viewpoints and largely overlap with the satellite reference, indicating robust cross-view correspondence and reduced spurious responses. Overall, the heatmaps show tighter localization and better alignment between drone and satellite views after applying our approach.