Table of Contents
Fetching ...

Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone

Tristan Amadei, Enric Meinhardt-Llopis, Benedicte Bascle, Corentin Abgrall, Gabriele Facciolo

TL;DR

<3-5 sentence high-level summary> The paper tackles GNSS-denied UAV localization by eliminating the need for paired UAV-reference training data. It introduces CAEVL, a lightweight, edge-based autoencoder that uses perceptual loss and a non-contrastive VICRegL fine-tuning stage to learn domain-invariant embeddings from satellite-reference imagery alone, along with a challenging high-altitude UAV benchmark, ViLD. ViLD comprises real UAV flights and large sets of satellite-derived reference crops, capturing vignetting and non-nadir views up to 1600 m altitude, and is released to the community. Results show CAEVL achieves competitive localization accuracy with far lower computational cost compared to fully supervised methods, demonstrating strong generalization and robustness to cross-view shifts; the work also provides extensive ablations and robustness analyses to validate the approach.

Abstract

Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.

Beyond Paired Data: Self-Supervised UAV Geo-Localization from Reference Imagery Alone

TL;DR

<3-5 sentence high-level summary> The paper tackles GNSS-denied UAV localization by eliminating the need for paired UAV-reference training data. It introduces CAEVL, a lightweight, edge-based autoencoder that uses perceptual loss and a non-contrastive VICRegL fine-tuning stage to learn domain-invariant embeddings from satellite-reference imagery alone, along with a challenging high-altitude UAV benchmark, ViLD. ViLD comprises real UAV flights and large sets of satellite-derived reference crops, capturing vignetting and non-nadir views up to 1600 m altitude, and is released to the community. Results show CAEVL achieves competitive localization accuracy with far lower computational cost compared to fully supervised methods, demonstrating strong generalization and robustness to cross-view shifts; the work also provides extensive ablations and robustness analyses to validate the approach.

Abstract

Image-based localization in GNSS-denied environments is critical for UAV autonomy. Existing state-of-the-art approaches rely on matching UAV images to geo-referenced satellite images; however, they typically require large-scale, paired UAV-satellite datasets for training. Such data are costly to acquire and often unavailable, limiting their applicability. To address this challenge, we adopt a training paradigm that removes the need for UAV imagery during training by learning directly from satellite-view reference images. This is achieved through a dedicated augmentation strategy that simulates the visual domain shift between satellite and real-world UAV views. We introduce CAEVL, an efficient model designed to exploit this paradigm, and validate it on ViLD, a new and challenging dataset of real-world UAV images that we release to the community. Our method achieves competitive performance compared to approaches trained with paired data, demonstrating its effectiveness and strong generalization capabilities.

Paper Structure

This paper contains 28 sections, 4 equations, 31 figures, 6 tables.

Figures (31)

  • Figure 1: Recall@1 at 100m and 150m for CAEVL and other SOTA methods on the ViLD dataset. CAEVL achieves high accuracy while requiring fewer GFLOPs per query and without using labeled paired data during training. A more comprehensive set of results can be found in \ref{['tab:recall_comparison']}.
  • Figure 2: Step 1: define a regular tiling (blue points) that covers the entire flight trajectory. Step 2: define a search zone around the flight trajectory and select the points from the regular tiling inside the search zone (red points). Figures on the lower row show zooms of the upper images, to display closer details. For each geographic point previously selected, we extract a reference image centered around it and rotate it in the direction of the heading of the drone.
  • Figure 3: Examples of randomly picked UAV images (upper row) and their geographically closest reference images (lower row). These images are extracted from our proposed dataset ViLD.
  • Figure 4: Overview of CAEVL. All input images are first processed through a Canny filter to extract the edges. An autoencoder is trained using a pixel-wise $L_2$ loss and a perceptual loss. The decoder is then discarded and the encoder is fine-tuned using a non-contrastive approach. Two views of the same input image are passed through the encoder to produce both local features (feature maps before pooling) and global features (embeddings after pooling). The local features are fed to a local projection head to project them to a smaller space. Two sets of matches are computed: one using the spatial information from each view, and the other based on the $L_2$-distance in the embedding space. The VICReg criterion is then applied to these matched spatial embeddings. Furthermore, the global features are passed into a global projection head to produce global embeddings. The VICReg criterion is applied on these global embeddings. After training, the local and global projection heads are discarded. The encoder is kept to compute embeddings of UAV and reference images, that will be compared using the cosine similarity.
  • Figure 5: Evolution of Recall@1 across different distance thresholds (100m, 150m, 250m, 500m) for all evaluated methods, using fine-tuned models when applicable. Higher values indicate better localization performance.
  • ...and 26 more figures