Table of Contents
Fetching ...

Semantic-Aware Reconstruction Error for Detecting AI-Generated Images

Ju Yeon Kang, Jaehong Park, Semin Kim, Ji Won Yoon, Nam Soo Kim

Abstract

Recently, AI-generated image detection has gained increasing attention, as the rapid advancement of image generation technologies has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts and thus overfit to the models used for training. To address this limitation, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The key hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE provides a robust and discriminative feature for detecting fake images across diverse generative models. Additionally, we introduce a fusion module that integrates SARE into the backbone detector via a cross-attention mechanism. Image features attend to semantic representations extracted from SARE, enabling the model to adaptively leverage semantic information. Experimental results demonstrate that the proposed method achieves strong generalization, outperforming existing baselines on benchmarks including GenImage and ForenSynths. We further validate the effectiveness of caption guidance through a detailed analysis of semantic shifts, confirming its ability to enhance detection robustness.

Semantic-Aware Reconstruction Error for Detecting AI-Generated Images

Abstract

Recently, AI-generated image detection has gained increasing attention, as the rapid advancement of image generation technologies has raised serious concerns about their potential misuse. While existing detection methods have achieved promising results, their performance often degrades significantly when facing fake images from unseen, out-of-distribution (OOD) generative models, since they primarily rely on model-specific artifacts and thus overfit to the models used for training. To address this limitation, we propose a novel representation, namely Semantic-Aware Reconstruction Error (SARE), that measures the semantic difference between an image and its caption-guided reconstruction. The key hypothesis behind SARE is that real images, whose captions often fail to fully capture their complex visual content, may undergo noticeable semantic shifts during the caption-guided reconstruction process. In contrast, fake images, which closely align with their captions, show minimal semantic changes. By quantifying these semantic shifts, SARE provides a robust and discriminative feature for detecting fake images across diverse generative models. Additionally, we introduce a fusion module that integrates SARE into the backbone detector via a cross-attention mechanism. Image features attend to semantic representations extracted from SARE, enabling the model to adaptively leverage semantic information. Experimental results demonstrate that the proposed method achieves strong generalization, outperforming existing baselines on benchmarks including GenImage and ForenSynths. We further validate the effectiveness of caption guidance through a detailed analysis of semantic shifts, confirming its ability to enhance detection robustness.

Paper Structure

This paper contains 33 sections, 7 equations, 9 figures, 14 tables.

Figures (9)

  • Figure 1: Comparison of caption-guided reconstructions for real and fake images. Real images, whose captions often fail to fully capture their complex visual content, undergo noticeable semantic shifts during caption-guided reconstruction. In contrast, fake images, which align closely with their captions, tend to exhibit minimal semantic changes.
  • Figure 2: Examples from the GenImage dataset genimage and their corresponding DIREs dire, and SAREs. Images are reconstructed using Stable Diffusion v1 sd. DIRE may produce larger errors for OOD fake images than for real images, contradicting its underlying assumption. In contrast, SARE consistently yields higher values for real images than for fake images. For clearer visualization, the pixel values of the error maps are scaled by 2.
  • Figure 3: Overview of the SARE framework. Our method reconstructs the input image conditioned on its caption using the Stable Diffusion model sd with classifier-free guidance. SARE is computed as the difference between the input and reconstructed image, and is incorporated into the detection process through a cross-attention module that leverages image features as queries and SARE features as keys and values. The pixel values of the SARE are scaled by 2 for clearer visualization.
  • Figure 4: Semantic shift analysis based on LPIPS scores lpips. Higher scores indicate lower similarity between the original and reconstructed images. Images are reconstructed under two conditions: with and without caption guidance.
  • Figure 5: Real and fake images from the GenImage dataset genimage, with captions generated by a pre-trained BLIP model blip, corresponding reconstructed images, and SAREs. The pixel values of the SARE are scaled by 2 for clearer visualization.
  • ...and 4 more figures