Table of Contents
Fetching ...

Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

Yiran He, Yun Cao, Bowen Yang, Zeyu Zhang

TL;DR

This work investigates the use of multimodal LLMs for forensics of AI-generated content in images, proposing a two-stage framework that first detects forgery and then analyzes tampering with localization, content description, reasoning, and generation-method tracing. Through careful prompt engineering and few-shot learning, the approach leverages LLM semantic understanding to achieve competitive detection and high-quality, interpretable forensic reports, demonstrated across diverse datasets with metrics such as $AUC$ and localization scores. The study shows GPT-4V yielding the strongest performance among tested models, while also highlighting limitations like refusals and semantic confusion on real-face images, and it provides ablations to quantify the impact of prompts and exemplars. The work suggests practical pathways to integrate LLMs with downstream tools and extend the paradigm to video and audio forgery, aiming to enhance robustness and scalability of forensic analysis in real-world settings.

Abstract

The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.

Can GPT tell us why these images are synthesized? Empowering Multimodal Large Language Models for Forensics

TL;DR

This work investigates the use of multimodal LLMs for forensics of AI-generated content in images, proposing a two-stage framework that first detects forgery and then analyzes tampering with localization, content description, reasoning, and generation-method tracing. Through careful prompt engineering and few-shot learning, the approach leverages LLM semantic understanding to achieve competitive detection and high-quality, interpretable forensic reports, demonstrated across diverse datasets with metrics such as and localization scores. The study shows GPT-4V yielding the strongest performance among tested models, while also highlighting limitations like refusals and semantic confusion on real-face images, and it provides ablations to quantify the impact of prompts and exemplars. The work suggests practical pathways to integrate LLMs with downstream tools and extend the paradigm to video and audio forgery, aiming to enhance robustness and scalability of forensic analysis in real-world settings.

Abstract

The rapid development of generative AI facilitates content creation and makes image manipulation easier and more difficult to detect. While multimodal Large Language Models (LLMs) have encoded rich world knowledge, they are not inherently tailored for combating AI-generated Content (AIGC) and struggle to comprehend local forgery details. In this work, we investigate the application of multimodal LLMs in forgery detection. We propose a framework capable of evaluating image authenticity, localizing tampered regions, providing evidence, and tracing generation methods based on semantic tampering clues. Our method demonstrates that the potential of LLMs in forgery analysis can be effectively unlocked through meticulous prompt engineering and the application of few-shot learning techniques. We conduct qualitative and quantitative experiments and show that GPT4V can achieve an accuracy of 92.1% in Autosplice and 86.3% in LaMa, which is competitive with state-of-the-art AIGC detection methods. We further discuss the limitations of multimodal LLMs in such tasks and propose potential improvements.

Paper Structure

This paper contains 16 sections, 1 equation, 11 figures, 5 tables.

Figures (11)

  • Figure 1: The overall process of leveraging multimodal LLMs to analyze synthesized images. First, we treat it as a fake image classification task. Then, we stimulate LLMs' forensic analyzing ability by prompt engineering and ICL learning. LLMs generate the final report from four perspectives: Location, Contents, Visible Details, and Generation Method.
  • Figure 2: The overall framework of our proposed multimodal LLM forensic analysis framework. By leveraging a two-stage workflow, we can use LLMs once and for all in different types of tasks: (1) localizing the manipulated regions, (2) describing the forged objects, (3) providing reasons for the forgery judgment, and (4) tracing the forgery method.
  • Figure 3: A list of prompts for GPT4V in detecting 1,000 faces from the Autosplice dataset. At the top, we show that the design of all five prompts is based on five basic principles: Profile, Goal, Constraint, Workflow, and Style. From top to bottom, prompts are getting longer and longer, adding more and more principles. We use Prompt #4 in practice.
  • Figure 4: Prompts for GPT-4o when analyzing DeepFakes. In Stage 1, we use a simple prompt to let the llm answer a two-class question; in Stage 2, once recognizing an image as DeepFake, it must analyze the fake image from 4 perspectives: localization, description, reasoning, and tracing. In this process, we provide GPT with as many perspectives for consideration as possible. We also use two examples in the user prompt to inspire the ICL ability of the LLM, which is not shown here.
  • Figure 5: Examples of GPT-4o for DeepFake classification in Stage 1, containing both objects and human faces. Left: Results for real images from the Caltech-101 caltech101 dataset and the Caltech-WebFaces fink_perona_2022 dataset. Right: Results for AI-generated images from Stable Diffusion stablediffusion and StyleGAN stylegan dataset. The responses for real faces are labeled in green, while those for AI-generated faces are labeled in pink. Both success (with a happy icon) and failure (with an unhappy icon) are shown.
  • ...and 6 more figures