Table of Contents
Fetching ...

Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

Haoyuan Li, Chang Xu, Wen Yang, Li Mi, Huai Yu, Haijian Zhang

TL;DR

This work proposes an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion and enabling generic CVGL for UAV images without feature fine-tuning or data-driven training.

Abstract

Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.

Unsupervised Multi-view UAV Image Geo-localization via Iterative Rendering

TL;DR

This work proposes an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion and enabling generic CVGL for UAV images without feature fine-tuning or data-driven training.

Abstract

Unmanned Aerial Vehicle (UAV) Cross-View Geo-Localization (CVGL) presents significant challenges due to the view discrepancy between oblique UAV images and overhead satellite images. Existing methods heavily rely on the supervision of labeled datasets to extract viewpoint-invariant features for cross-view retrieval. However, these methods have expensive training costs and tend to overfit the region-specific cues, showing limited generalizability to new regions. To overcome this issue, we propose an unsupervised solution that lifts the scene representation to 3d space from UAV observations for satellite image generation, providing robust representation against view distortion. By generating orthogonal images that closely resemble satellite views, our method reduces view discrepancies in feature representation and mitigates shortcuts in region-specific image pairing. To further align the rendered image's perspective with the real one, we design an iterative camera pose updating mechanism that progressively modulates the rendered query image with potential satellite targets, eliminating spatial offsets relative to the reference images. Additionally, this iterative refinement strategy enhances cross-view feature invariance through view-consistent fusion across iterations. As such, our unsupervised paradigm naturally avoids the problem of region-specific overfitting, enabling generic CVGL for UAV images without feature fine-tuning or data-driven training. Experiments on the University-1652 and SUES-200 datasets demonstrate that our approach significantly improves geo-localization accuracy while maintaining robustness across diverse regions. Notably, without model fine-tuning or paired training, our method achieves competitive performance with recent supervised methods.

Paper Structure

This paper contains 23 sections, 10 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Illustration of the proposed scene rendering approach for cross-view geo-localization. We propose a multi-view rendering image regression and retrieval approach that uses multiple UAV-captured views of a scene to predict its geo-location by retrieving matching satellite images from a database. First, we establish a 3D field representation of the query scene from multiple oblique images. Then, we iteratively render the virtual satellite view to align with the real satellite image from the database (yellow box). For notational convenience, we use $\bm{T}$ to represent both rotation $\bm{R}$ and translations $\bm{t}$ of camera poses.
  • Figure 2: Overview of the rendering-based UAV geo-localization. Given multiple oblique input views, our method first predicts an initial sparse reconstruction and learns a 3D representation using 3DGS. A virtual camera is estimated to render the scene and extract features for matching with real satellite images. The virtual camera pose is then iteratively updated based on feature matching, enabling high-fidelity novel view synthesis that aligns with the true satellite image.
  • Figure 3: Candidate camera pose selection. The black denotes the previous camera’s coordinate system. Inlier camera poses are marked in green, while outliers are marked in red. In the previous camera’s coordinate, $\Delta d$ is the x-y distance to the previous camera, and $\Delta \theta$, is the angular deviation relative to the z-axis in the previous camera.
  • Figure 4: The update of the rendered images. After extracting the global features of the rendered candidates, we refine the scene’s features using a view consistency fusion module. The module first computes the self-view consistency $\alpha$ between the rendered candidates and the previously rendered view and then calculates the cross-view consistency $\beta$ between the rendered candidates and their corresponding satellite images. The symbol $\sum$ denotes the feature fusion in Eq.\ref{['eq:fusion']}.
  • Figure 5: Illustration of rendered images and the corresponding true satellite images. The first two rows show the samples of the input drone images. The third row shows the rendered images of the query scene. The rendered image can align with the true satellite image targets (green boxes).
  • ...and 6 more figures