Enhancing Ground-to-Aerial Image Matching for Visual Misinformation Detection Using Semantic Segmentation
Emanuele Mule, Matteo Pannacci, Ali Ghasemi Goudarzi, Francesco Pro, Lorenzo Papa, Luca Maiano, Irene Amerini
TL;DR
This work tackles geolocating non-geo-tagged ground-view images by matching them to satellite imagery without GPS data. It introduces SAN-QUAD, a four-stream Siamese-like network that fuses RGB and semantic segmentation maps from both ground and satellite views, aided by a polar transformation and azimuth-aligned correlation to estimate orientation. The approach uses a shared-feature fusion strategy and a symmetric triplet-like loss to produce discriminative cross-view representations, achieving up to a 9.8 percentage point improvement in top-1 recall on a CVUSA subset compared to prior methods. The method has practical significance for misinformation detection, journalism, forensics, and Earth observation by enabling robust provenance verification of manipulated or non-tagged images. Future work includes learnable channel-wise fusion and testing on additional datasets to further generalize the approach.
Abstract
The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non-geo-tagged ground-view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground-view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four-stream Siamese-like architecture, the Quadruple Semantic Align Net (SAN-QUAD), which extends previous state-of-the-art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8% over prior methods across various FoV settings.
