Table of Contents
Fetching ...

Satellite-Free Training for Drone-View Geo-Localization

Tao Liu, Yingzhi Zhang, Kan Ren, Xiaoqi Zhao

Abstract

Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.

Satellite-Free Training for Drone-View Geo-Localization

Abstract

Drone-view geo-localization (DVGL) aims to determine the location of drones in GPS-denied environments by retrieving the corresponding geotagged satellite tile from a reference gallery given UAV observations of a location. In many existing formulations, these observations are represented by a single oblique UAV image. In contrast, our satellite-free setting is designed for multi-view UAV sequences, which are used to construct a geometry-normalized UAV-side location representation before cross-view retrieval. Existing approaches rely on satellite imagery during training, either through paired supervision or unsupervised alignment, which limits practical deployment when satellite data are unavailable or restricted. In this paper, we propose a satellite-free training (SFT) framework that converts drone imagery into cross-view compatible representations through three main stages: drone-side 3D scene reconstruction, geometry-based pseudo-orthophoto generation, and satellite-free feature aggregation for retrieval. Specifically, we first reconstruct dense 3D scenes from multi-view drone images using 3D Gaussian splatting and project the reconstructed geometry into pseudo-orthophotos via PCA-guided orthographic projection. This rendering stage operates directly on reconstructed scene geometry without requiring camera parameters at rendering time. Next, we refine these orthophotos with lightweight geometry-guided inpainting to obtain texture-complete drone-side views. Finally, we extract DINOv3 patch features from the generated orthophotos, learn a Fisher vector aggregation model solely from drone data, and reuse it at test time to encode satellite tiles for cross-view retrieval. Experimental results on University-1652 and SUES-200 show that our SFT framework substantially outperforms satellite-free generalization baselines and narrows the gap to methods trained with satellite imagery.

Paper Structure

This paper contains 50 sections, 39 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Illustration of converting a 3D Gaussian field into a dense point cloud.
  • Figure 2: Pipeline of PCA-based ground plane projection and soft-roof rendering. Top left shows a dense point cloud and estimated ground plane. Top right depicts the projection onto a raster grid. Bottom left illustrates two-layer compositing with the roof layer weighted by a height-adaptive Gaussian kernel. Bottom right displays the resulting UAV pseudo-orthophoto.
  • Figure 3: Qualitative Drone$\rightarrow$Satellite retrieval results. Each row corresponds to one query location. From left to right: (i) several representative raw UAV views from the multi-view query sequence, shown for visualization only; (ii) the pseudo-orthophoto reconstructed from the full UAV sequence and used as the actual query representation in our method; and (iii) the top-5 retrieved satellite images, where the correct match is highlighted in green and incorrect candidates are framed in red.
  • Figure 4: Qualitative Satellite$\rightarrow$Drone retrieval results. Each row shows a satellite query on the left and the top-5 retrieved drone pseudo-orthophotos on the right, with correct matches highlighted in green and incorrect ones framed in red.
  • Figure 5: Illustration of the proposed geometry-guided inpainting pipeline. From left to right, three oblique drone views are used to reconstruct a dense point cloud, which is then projected into a pseudo-orthophoto together with its corresponding hole mask. We compare a purely classical inpainting strategy (Telea + KNN) with our geometry-aware LaMa completion (Geom-LaMa). The proposed Geom-LaMa variant better preserves roof contours, road boundaries, and large background regions, while effectively removing artifacts near building edges.
  • ...and 3 more figures