Table of Contents
Fetching ...

MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

Shengkang Wang, Hongzhan Lin, Ziyang Luo, Zhen Ye, Guang Chen, Jing Ma

TL;DR

MFC-Bench introduces a large-scale benchmark (35K multimodal samples) to evaluate large vision-language models on three verdict-prediction tasks in multimodal fact-checking: Manipulation Classification, Out-of-Context Classification, and Veracity Classification. The study comprehensively assesses 18 LVLMs across zero-shot and prompting strategies (including CoT and ICL) and analyzes justification production, model interpretability, and biases. Key findings show that current LVLMs struggle with Manipulation Classification and Veracity tasks, while Out-of-Context classification is comparatively easier, with human performance still surpassing most models in certain tasks. The work emphasizes the need for improved factual grounding in LVLMs and suggests directions for future research in trustworthy AI, interpretability, and broader multimodal capabilities.

Abstract

Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for MFC: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field.

MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models

TL;DR

MFC-Bench introduces a large-scale benchmark (35K multimodal samples) to evaluate large vision-language models on three verdict-prediction tasks in multimodal fact-checking: Manipulation Classification, Out-of-Context Classification, and Veracity Classification. The study comprehensively assesses 18 LVLMs across zero-shot and prompting strategies (including CoT and ICL) and analyzes justification production, model interpretability, and biases. Key findings show that current LVLMs struggle with Manipulation Classification and Veracity tasks, while Out-of-Context classification is comparatively easier, with human performance still surpassing most models in certain tasks. The work emphasizes the need for improved factual grounding in LVLMs and suggests directions for future research in trustworthy AI, interpretability, and broader multimodal capabilities.

Abstract

Large vision-language models (LVLMs) have significantly improved multimodal reasoning tasks, such as visual question answering and image captioning. These models embed multimodal facts within their parameters, rather than relying on external knowledge bases to store factual information explicitly. However, the content discerned by LVLMs may deviate from factuality due to inherent bias or incorrect inference. To address this issue, we introduce MFC-Bench, a rigorous and comprehensive benchmark designed to evaluate the factual accuracy of LVLMs across three stages of verdict prediction for MFC: Manipulation, Out-of-Context, and Veracity Classification. Through our evaluation on MFC-Bench, we benchmarked a dozen diverse and representative LVLMs, uncovering that current models still fall short in multimodal fact-checking and demonstrate insensitivity to various forms of manipulated content. We hope that MFC-Bench could raise attention to the trustworthy AI potentially assisted by LVLMs in the future. The MFC-Bench and accompanying resources are publicly accessible at https://github.com/wskbest/MFC-Bench, contributing to ongoing research in the multimodal fact-checking field.
Paper Structure (48 sections, 9 figures, 11 tables)

This paper contains 48 sections, 9 figures, 11 tables.

Figures (9)

  • Figure 1: MFC-Bench is a comprehensive benchmark designed to evaluate the LVLMs across three stages of verdict prediction for MFC: Manipulation Classification, Out-of-Context Classification, and Veracity Classification.
  • Figure 2: Comparison of prompts in zero-shot and few-shot scenarios with and without CoT.
  • Figure 3: Comparison between few-shot conditions w/ and w/o CoT for GPT-4o, LLaVA-OneVision and Qwen2-VL.
  • Figure 4: The pipeline of dataset construction.
  • Figure 5: Effect of prompts specifically designed for different types of manipulation techniques.
  • ...and 4 more figures