Table of Contents
Fetching ...

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

Jin Wang, Chenghui Lv, Xian Li, Shichao Dong, Huadong Li, kelu Yao, Chao Li, Wenqi Shao, Ping Luo

TL;DR

Forensics-Bench introduces a comprehensive forgery-detection benchmark for Large Vision Language Models (LVLMs), designed to test recognition, localization, and reasoning across diverse media manipulated by AI. It assembles 63,292 multimodal questions spanning 112 forgery types across five perspectives, and evaluates 25 LVLMs (22 open-source and 3 proprietary) to reveal substantial challenges and biases in current models. The study provides two additional evaluation protocols—robustness under perturbations and forgery-model attribution—plus extensive analyses across semantics, modalities, tasks, and forgery sources. By offering a standardized, large-scale testing ground, Forensics-Bench aims to propel the development of all-around LVLM forgery detectors and guide future research in LVLM alignment with real-world forgery mitigation.

Abstract

Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.

Forensics-Bench: A Comprehensive Forgery Detection Benchmark Suite for Large Vision Language Models

TL;DR

Forensics-Bench introduces a comprehensive forgery-detection benchmark for Large Vision Language Models (LVLMs), designed to test recognition, localization, and reasoning across diverse media manipulated by AI. It assembles 63,292 multimodal questions spanning 112 forgery types across five perspectives, and evaluates 25 LVLMs (22 open-source and 3 proprietary) to reveal substantial challenges and biases in current models. The study provides two additional evaluation protocols—robustness under perturbations and forgery-model attribution—plus extensive analyses across semantics, modalities, tasks, and forgery sources. By offering a standardized, large-scale testing ground, Forensics-Bench aims to propel the development of all-around LVLM forgery detectors and guide future research in LVLM alignment with real-world forgery mitigation.

Abstract

Recently, the rapid development of AIGC has significantly boosted the diversities of fake media spread in the Internet, posing unprecedented threats to social security, politics, law, and etc. To detect the ever-increasingly diverse malicious fake media in the new era of AIGC, recent studies have proposed to exploit Large Vision Language Models (LVLMs) to design robust forgery detectors due to their impressive performance on a wide range of multimodal tasks. However, it still lacks a comprehensive benchmark designed to comprehensively assess LVLMs' discerning capabilities on forgery media. To fill this gap, we present Forensics-Bench, a new forgery detection evaluation benchmark suite to assess LVLMs across massive forgery detection tasks, requiring comprehensive recognition, location and reasoning capabilities on diverse forgeries. Forensics-Bench comprises 63,292 meticulously curated multi-choice visual questions, covering 112 unique forgery detection types from 5 perspectives: forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. We conduct thorough evaluations on 22 open-sourced LVLMs and 3 proprietary models GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, highlighting the significant challenges of comprehensive forgery detection posed by Forensics-Bench. We anticipate that Forensics-Bench will motivate the community to advance the frontier of LVLMs, striving for all-around forgery detectors in the era of AIGC. The deliverables will be updated at https://Forensics-Bench.github.io/.

Paper Structure

This paper contains 23 sections, 19 figures, 26 tables.

Figures (19)

  • Figure 1: Overview of Forensics-Bench. Forensics-Bench consists of $63K$ samples, covering $112$ unique forgery detection types from five different perspectives characterizing forgeries. As shown in different rings from inside out, these five perspectives include forgery semantics, forgery modalities, forgery tasks, forgery types and forgery models. From the view of forgery semantics, Forensics-Bench includes data featuring: HS $\rightarrow$ Human Subject and GS $\rightarrow$ General subject. From the view of forgery modalities, Forensics-Bench includes data featuring: RGB$\&$TXT$\rightarrow$RGB Image $\&$ Text, VID$\rightarrow$Video and etc. From the view of forgery tasks, Forensics-Bench includes data featuring: TL$\rightarrow$Temporal Localization, SLS$\rightarrow$Spatial Localization (Segmentation), BC$\rightarrow$Binary Classification and etc. From the view of forgery types, Forensics-Bench includes data featuring: TS$\rightarrow$Text Swap, FSS$\rightarrow$Face Swap (Single Face), ES$\rightarrow$Entire Synthesis and etc. From the view of forgery models, Forensics-Bench includes data generated from: GAN$\rightarrow$Generative Adversarial Models, DF$\rightarrow$Diffusion models, VAE$\rightarrow$Variational Auto-Encoders and etc. Forensics-Bench enables comprehensive evaluations of LVLMs on versatile forgery detection types in the evolving era of AIGC. Please see Appendix \ref{['sec:abbreviation']} for more detailed abbreviations.
  • Figure 2: Forensics-Bench evaluation results of Large Vision Language Models (LVLMs). We visualize evaluation results of representative LVLMs in five Forensics-Bench perspectives on the left side and present the overall leaderboard results on the right side. For detailed quantitative results, please refer to Table \ref{['tab: overall_results']}.
  • Figure 3: An illustration of the pipeline for data collection of Forensics-Bench. First, from the designed $5$ perspectives of Forensics-Bench, we searched the related public available dataset from the Internet. Then, we collated the retrieved dataset into a uniformed metadata format. Finally, we either manually transformed original data into handcrafted Questions&Answers (Q&A) or proceed the Q&A transformation with the aid of ChatGPT. Forensics-Bench supports evaluations over a diverse kinds of forgeries across various perspectives. Please zoom in for better visualizations.
  • Figure 4: Results of Forensics-Bench from the perspective of forgery semantics.. Most LVLMs did not demonstrate strong bias towards certain media content in terms of human subject vs general subject.
  • Figure 5: Results of Forensics-Bench from the perspective of forgery modality.. Current LVLMs failed to perform well across all forgery modalities collected in Forensics-Bench.
  • ...and 14 more figures