Table of Contents
Fetching ...

A Semantic Segmentation-guided Approach for Ground-to-Aerial Image Matching

Francesco Pro, Nikolaos Dionelis, Luca Maiano, Bertrand Le Saux, Irene Amerini

TL;DR

This work tackles GPS-free geo-localization by re-framing ground-to-aerial image matching as a cross-view retrieval task. It introduces Semantic Align Net (SAN), a three-branch Siamese-like network that fuses ground-view, satellite-view, and semantic segmentation masks (via NEOS) after a polar transformation to align domains. SAN uses feature concatenation and correlation to estimate similarity and orientation, achieving superior performance over baselines across multiple FoVs, including limited and panoramic ground images. The approach enhances robustness to viewpoint and content variations and demonstrates practical impact for Earth Observation and related applications, with plans to scale to the full CVUSA dataset and extend segmentation to ground images.

Abstract

Nowadays the accurate geo-localization of ground-view images has an important role across domains as diverse as journalism, forensics analysis, transports, and Earth Observation. This work addresses the problem of matching a query ground-view image with the corresponding satellite image without GPS data. This is done by comparing the features from a ground-view image and a satellite one, innovatively leveraging the corresponding latter's segmentation mask through a three-stream Siamese-like network. The proposed method, Semantic Align Net (SAN), focuses on limited Field-of-View (FoV) and ground panorama images (images with a FoV of 360°). The novelty lies in the fusion of satellite images in combination with their semantic segmentation masks, aimed at ensuring that the model can extract useful features and focus on the significant parts of the images. This work shows how SAN through semantic analysis of images improves the performance on the unlabelled CVUSA dataset for all the tested FoVs.

A Semantic Segmentation-guided Approach for Ground-to-Aerial Image Matching

TL;DR

This work tackles GPS-free geo-localization by re-framing ground-to-aerial image matching as a cross-view retrieval task. It introduces Semantic Align Net (SAN), a three-branch Siamese-like network that fuses ground-view, satellite-view, and semantic segmentation masks (via NEOS) after a polar transformation to align domains. SAN uses feature concatenation and correlation to estimate similarity and orientation, achieving superior performance over baselines across multiple FoVs, including limited and panoramic ground images. The approach enhances robustness to viewpoint and content variations and demonstrates practical impact for Earth Observation and related applications, with plans to scale to the full CVUSA dataset and extend segmentation to ground images.

Abstract

Nowadays the accurate geo-localization of ground-view images has an important role across domains as diverse as journalism, forensics analysis, transports, and Earth Observation. This work addresses the problem of matching a query ground-view image with the corresponding satellite image without GPS data. This is done by comparing the features from a ground-view image and a satellite one, innovatively leveraging the corresponding latter's segmentation mask through a three-stream Siamese-like network. The proposed method, Semantic Align Net (SAN), focuses on limited Field-of-View (FoV) and ground panorama images (images with a FoV of 360°). The novelty lies in the fusion of satellite images in combination with their semantic segmentation masks, aimed at ensuring that the model can extract useful features and focus on the significant parts of the images. This work shows how SAN through semantic analysis of images improves the performance on the unlabelled CVUSA dataset for all the tested FoVs.
Paper Structure (5 sections, 3 equations, 2 figures, 2 tables)

This paper contains 5 sections, 3 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Example of the ground-to-aerial matching problem. The query ground-view image is matched to the polar transformed aerial image.
  • Figure 2: Our SAN network comprises three VGG16 branches extracting features from the (1) ground-view image, the (2) satellite-view image, and (3) its corresponding semantic segmentation mask. The features from the last two branches are correlated and compared with the ones from the ground image to estimate a similarity score.