Table of Contents
Fetching ...

Enhancing Ground-to-Aerial Image Matching for Visual Misinformation Detection Using Semantic Segmentation

Emanuele Mule, Matteo Pannacci, Ali Ghasemi Goudarzi, Francesco Pro, Lorenzo Papa, Luca Maiano, Irene Amerini

TL;DR

This work tackles geolocating non-geo-tagged ground-view images by matching them to satellite imagery without GPS data. It introduces SAN-QUAD, a four-stream Siamese-like network that fuses RGB and semantic segmentation maps from both ground and satellite views, aided by a polar transformation and azimuth-aligned correlation to estimate orientation. The approach uses a shared-feature fusion strategy and a symmetric triplet-like loss to produce discriminative cross-view representations, achieving up to a 9.8 percentage point improvement in top-1 recall on a CVUSA subset compared to prior methods. The method has practical significance for misinformation detection, journalism, forensics, and Earth observation by enabling robust provenance verification of manipulated or non-tagged images. Future work includes learnable channel-wise fusion and testing on additional datasets to further generalize the approach.

Abstract

The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non-geo-tagged ground-view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground-view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four-stream Siamese-like architecture, the Quadruple Semantic Align Net (SAN-QUAD), which extends previous state-of-the-art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8% over prior methods across various FoV settings.

Enhancing Ground-to-Aerial Image Matching for Visual Misinformation Detection Using Semantic Segmentation

TL;DR

This work tackles geolocating non-geo-tagged ground-view images by matching them to satellite imagery without GPS data. It introduces SAN-QUAD, a four-stream Siamese-like network that fuses RGB and semantic segmentation maps from both ground and satellite views, aided by a polar transformation and azimuth-aligned correlation to estimate orientation. The approach uses a shared-feature fusion strategy and a symmetric triplet-like loss to produce discriminative cross-view representations, achieving up to a 9.8 percentage point improvement in top-1 recall on a CVUSA subset compared to prior methods. The method has practical significance for misinformation detection, journalism, forensics, and Earth observation by enabling robust provenance verification of manipulated or non-tagged images. Future work includes learnable channel-wise fusion and testing on additional datasets to further generalize the approach.

Abstract

The recent advancements in generative AI techniques, which have significantly increased the online dissemination of altered images and videos, have raised serious concerns about the credibility of digital media available on the Internet and distributed through information channels and social networks. This issue particularly affects domains that rely heavily on trustworthy data, such as journalism, forensic analysis, and Earth observation. To address these concerns, the ability to geolocate a non-geo-tagged ground-view image without external information, such as GPS coordinates, has become increasingly critical. This study tackles the challenge of linking a ground-view image, potentially exhibiting varying fields of view (FoV), to its corresponding satellite image without the aid of GPS data. To achieve this, we propose a novel four-stream Siamese-like architecture, the Quadruple Semantic Align Net (SAN-QUAD), which extends previous state-of-the-art (SOTA) approaches by leveraging semantic segmentation applied to both ground and satellite imagery. Experimental results on a subset of the CVUSA dataset demonstrate significant improvements of up to 9.8% over prior methods across various FoV settings.

Paper Structure

This paper contains 21 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Example of Ground-to-Aerial Matching: This illustration demonstrates the task of identifying the corresponding satellite image for a given ground-view query. The top rectangle displays the ground-view query image while the bottom section shows the top-5 (from left to right) satellite image matches. The correct match is highlighted in green, while incorrect matches are marked in red.
  • Figure 2: Example of a satellite semantic segmentation mask produced by the NEOS model.
  • Figure 3: Example of a ground semantic segmentation mask produced by the Mask2Former model.
  • Figure 4: General overview of our methodology: SAN-QUAD. The architecture is composed of four branches, two for the ground viewpoint and two for the satellite one. Each branch takes as input either an RGB image or a semantic segmentation mask and produces a feature volume. The volumes relative to the same viewpoint are then combined into the final feature representations which are compared to obtain the most likely orientation and perform the matching.