Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng; Yova Kementchedjhieva; Preslav Nakov; Iryna Gurevych

Multimodal Large Language Models to Support Real-World Fact-Checking

Jiahui Geng, Yova Kementchedjhieva, Preslav Nakov, Iryna Gurevych

TL;DR

The paper introduces an evidence-free framework to evaluate multimodal large language models (MLLMs) for real-world fact-checking by eliciting predictions, explanations, and confidence. It systematically tests GPT-4V, LLaVA, and MiniGPT-v2 across Fauxtography, COSMOS, MOCHEG, and Post-4V datasets, using prompt ensembles (PE) and in-context learning (ICL) to probe accuracy, calibration, and reasoning. Key findings show GPT-4V achieving around 80% overall accuracy with well-calibrated confidence, while open-source models exhibit biases and prompt sensitivity; PE and especially ICL improve performance in several settings. The work highlights the potential and limitations of current MLLMs for combatting multimodal misinformation and points to directions like knowledge distillation and external knowledge integration to build more robust, trustworthy fact-checking tools.

Abstract

Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information. While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied. Here is aim to bridge this gap. In particular, we propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking. Our methodology is evidence-free, leveraging only these models' intrinsic knowledge and reasoning capabilities. By designing prompts that extract models' predictions, explanations, and confidence levels, we delve into research questions concerning model accuracy, robustness, and reasons for failure. We empirically find that (1) GPT-4V exhibits superior performance in identifying malicious and misleading multimodal claims, with the ability to explain the unreasonable aspects and underlying motives, and (2) existing open-source models exhibit strong biases and are highly sensitive to the prompt. Our study offers insights into combating false multimodal information and building secure, trustworthy multimodal models. To the best of our knowledge, we are the first to evaluate MLLMs for real-world fact-checking.

Multimodal Large Language Models to Support Real-World Fact-Checking

TL;DR

Abstract

Paper Structure (33 sections, 10 figures, 6 tables)

This paper contains 33 sections, 10 figures, 6 tables.

Introduction
Related Work
LLMs for Text-Only Fact-Checking
Multimodal Fact-Checking
Evaluation Framework
Datasets
Fauxtography
COSMOS
MOCHEG
Post-4V
Evaluation Prompt
Evaluation Metrics
Response Types
Accuracy Metrics
Experimental Setups
...and 18 more sections

Figures (10)

Figure 1: Illustration of our proposed framework to evaluate the capability of MLLMs as fact-checkers. Initially, we collect their responses to multimodal claims, encompassing predictions, explanations, and confidence levels. We then assess their performance across dimensions, including accuracy, bias, and their failure reasons.
Figure 2: Prompts obtained from ChatGPT that are used in prompt ensembles experiments.
Figure 3: The left graph illustrates the confidence score distribution of GPT-4V and LLaVA(13b), and the right graph presents their calibration curves. FAU: Fauxtography, COS: COSMOS, MOC: MOCHEG, POST: Post-4V.
Figure 4: Sampled fact-checking responses from different models and approaches. The first row shows the claim source and its veracity. The second row includes multimodal claims, and the subsequent four rows feature responses from GPT-4V, LLaVA(13b), LLaVA(13b) without image input, and LLaVA+ICL-1 (using the first demonstration), respectively. Purple text indicates hallucinations by the model when no images are present; red text shows outdated knowledge, and green text displays the model's analysis of image manipulation.
Figure 5: Average number of sentences in explanations across different models and settings. GPT-4V generates the longest explanations except on Post-4V. With one example, ICL-1 significantly increases the average explanation length.
...and 5 more figures

Multimodal Large Language Models to Support Real-World Fact-Checking

TL;DR

Abstract

Multimodal Large Language Models to Support Real-World Fact-Checking

Authors

TL;DR

Abstract

Table of Contents

Figures (10)