Table of Contents
Fetching ...

Deepfake Geography: Detecting AI-Generated Satellite Images

Mansur Yerzhanuly

TL;DR

This paper tackles the authenticity of satellite imagery in the era of AI-generated content. It systematically compares CNNs and Vision Transformers on a large RGB dataset derived from DM-AER and FSI, finding ViTs superior in accuracy and robustness. Explainability analyses using Grad-CAM for CNNs and Chefer's transformer attribution reveal complementary detection cues and bolster trust in the models. The work has practical implications for journalism, environmental science, and defense, and points to future work in multispectral/SAR data and frequency-domain artifact detection.

Abstract

The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer's attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT's superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.

Deepfake Geography: Detecting AI-Generated Satellite Images

TL;DR

This paper tackles the authenticity of satellite imagery in the era of AI-generated content. It systematically compares CNNs and Vision Transformers on a large RGB dataset derived from DM-AER and FSI, finding ViTs superior in accuracy and robustness. Explainability analyses using Grad-CAM for CNNs and Chefer's transformer attribution reveal complementary detection cues and bolster trust in the models. The work has practical implications for journalism, environmental science, and defense, and points to future work in multispectral/SAR data and frequency-domain artifact detection.

Abstract

The rapid advancement of generative models such as StyleGAN2 and Stable Diffusion poses a growing threat to the authenticity of satellite imagery, which is increasingly vital for reliable analysis and decision-making across scientific and security domains. While deepfake detection has been extensively studied in facial contexts, satellite imagery presents distinct challenges, including terrain-level inconsistencies and structural artifacts. In this study, we conduct a comprehensive comparison between Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) for detecting AI-generated satellite images. Using a curated dataset of over 130,000 labeled RGB images from the DM-AER and FSI datasets, we show that ViTs significantly outperform CNNs in both accuracy (95.11 percent vs. 87.02 percent) and overall robustness, owing to their ability to model long-range dependencies and global semantic structures. We further enhance model transparency using architecture-specific interpretability methods, including Grad-CAM for CNNs and Chefer's attention attribution for ViTs, revealing distinct detection behaviors and validating model trustworthiness. Our results highlight the ViT's superior performance in detecting structural inconsistencies and repetitive textural patterns characteristic of synthetic imagery. Future work will extend this research to multispectral and SAR modalities and integrate frequency-domain analysis to further strengthen detection capabilities and safeguard satellite imagery integrity in high-stakes applications.

Paper Structure

This paper contains 6 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: Comparison between a real rural satellite image (A, source: IndiaAI, 2023 indiaai2023deepfakemaps) and a synthetic urban image (B, source: GeekWire, 2021 yonck2021deepfakegeography), illustrating natural terrain continuity (A) versus repeated textures and abrupt transitions typical of AI-generated imagery (B).
  • Figure 2: F1-score, precision, and recall comparison over epochs between CNN and ViT.
  • Figure 3: Training and validation loss curves for ViT and CNN on 20 epochs.
  • Figure 4: Accuracy over epochs for CNN and ViT models on the training and validation sets.
  • Figure 5: Grad-CAM visualization of a fake satellite image classified by the CNN model.
  • ...and 5 more figures