Table of Contents
Fetching ...

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

Zhenglin Huang, Jinwei Hu, Xiangtai Li, Yiwei He, Xingyu Zhao, Bei Peng, Baoyuan Wu, Xiaowei Huang, Guangliang Cheng

TL;DR

This work tackles the rising threat of social-media image deepfakes by introducing SID-Set, a large, diverse benchmark of 300K real, synthetic, and tampered images with rich annotations and explanations. It then presents SIDA, a large multimodal model framework that jointly detects authenticity, localizes manipulated regions, and generates textual explanations by extending a vision-language backbone with DET and SEG tokens. Across detection, localization, robustness, and cross-benchmark tests, SIDA achieves strong performance and demonstrates interpretability through mask predictions and description generation, signaling practical utility for real-world misinformation defense. Limitations include dataset size and reliance on a single generation method, with future work aimed at expanding generation diversity and further improving localization accuracy.

Abstract

The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.

SIDA: Social Media Image Deepfake Detection, Localization and Explanation with Large Multimodal Model

TL;DR

This work tackles the rising threat of social-media image deepfakes by introducing SID-Set, a large, diverse benchmark of 300K real, synthetic, and tampered images with rich annotations and explanations. It then presents SIDA, a large multimodal model framework that jointly detects authenticity, localizes manipulated regions, and generates textual explanations by extending a vision-language backbone with DET and SEG tokens. Across detection, localization, robustness, and cross-benchmark tests, SIDA achieves strong performance and demonstrates interpretability through mask predictions and description generation, signaling practical utility for real-world misinformation defense. Limitations include dataset size and reliance on a single generation method, with future work aimed at expanding generation diversity and further improving localization accuracy.

Abstract

The rapid advancement of generative models in creating highly realistic images poses substantial risks for misinformation dissemination. For instance, a synthetic image, when shared on social media, can mislead extensive audiences and erode trust in digital content, resulting in severe repercussions. Despite some progress, academia has not yet created a large and diversified deepfake detection dataset for social media, nor has it devised an effective solution to address this issue. In this paper, we introduce the Social media Image Detection dataSet (SID-Set), which offers three key advantages: (1) extensive volume, featuring 300K AI-generated/tampered and authentic images with comprehensive annotations, (2) broad diversity, encompassing fully synthetic and tampered images across various classes, and (3) elevated realism, with images that are predominantly indistinguishable from genuine ones through mere visual inspection. Furthermore, leveraging the exceptional capabilities of large multimodal models, we propose a new image deepfake detection, localization, and explanation framework, named SIDA (Social media Image Detection, localization, and explanation Assistant). SIDA not only discerns the authenticity of images, but also delineates tampered regions through mask prediction and provides textual explanations of the model's judgment criteria. Compared with state-of-the-art deepfake detection models on SID-Set and other benchmarks, extensive experiments demonstrate that SIDA achieves superior performance among diversified settings. The code, model, and dataset will be released.

Paper Structure

This paper contains 25 sections, 6 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: The framework comparisons. Existing deepfake methods (a-b) are limited to detection, localization, or both. In contrast, SIDA (c) offers a more comprehensive solution, capable of handling detection, localization, and explanation tasks.
  • Figure 2: SID-Set examples. The 1st row is the synthetic images, while the 2nd row shows tampered images. (Zoom in to view)
  • Figure 3: Tampered image generation pipeline: It consists of four stages—extracting objects from captions using GPT-4o, obtaining object masks with Language-SAM, setting up replacement dictionaries for generating tampered images, and generating new images using Latent Diffusion. This figure illustrates an example of object replacement (e.g., "cat" to "dog") and attribute modification.
  • Figure 4: Examples of tampered images. (Zoom in to view)
  • Figure 5: The pipeline of SIDA: Given an image $x_{i}$ and the corresponding text input $x_{t}$, the last hidden layer for the <DET> token provides the detection result. If the detection result indicates a tampered image, SIDA extracts the <SEG> token to generate masks for the tampered regions. This figure shows an example where the man's face has been manipulated.
  • ...and 14 more figures