Multimodal Misinformation Detection using Large Vision-Language Models

Sahar Tahmasebi; Eric Müller-Budack; Ralph Ewerth

Multimodal Misinformation Detection using Large Vision-Language Models

Sahar Tahmasebi, Eric Müller-Budack, Ralph Ewerth

TL;DR

The paper tackles multimodal misinformation detection in a zero-shot setting by integrating an evidence retrieval stage and a multimodal fact-verification stage. It introduces LVLM4EV, a re-ranking approach that leverages LLMs and LVLMs to refine text and image evidences, followed by LVLM4FV, which uses prompted LVLMs for verdicts on retrieved multimodal evidence. Through experiments on MOCHEG and Factify, the authors demonstrate superior evidence retrieval and fact-verification performance, along with better generalization compared to a supervised baseline. The work highlights the potential of unsupervised, LVLM-based pipelines for scalable, cross-domain misinformation detection and sets the stage for future explorations into interpretability and richer annotations.

Abstract

The increasing proliferation of misinformation and its alarming impact have motivated both industry and academia to develop approaches for misinformation detection and fact checking. Recent advances on large language models (LLMs) have shown remarkable performance in various tasks, but whether and how LLMs could help with misinformation detection remains relatively underexplored. Most of existing state-of-the-art approaches either do not consider evidence and solely focus on claim related features or assume the evidence to be provided. Few approaches consider evidence retrieval as part of the misinformation detection but rely on fine-tuning models. In this paper, we investigate the potential of LLMs for misinformation detection in a zero-shot setting. We incorporate an evidence retrieval component into the process as it is crucial to gather pertinent information from various sources to detect the veracity of claims. To this end, we propose a novel re-ranking approach for multimodal evidence retrieval using both LLMs and large vision-language models (LVLM). The retrieved evidence samples (images and texts) serve as the input for an LVLM-based approach for multimodal fact verification (LVLM4FV). To enable a fair evaluation, we address the issue of incomplete ground truth for evidence samples in an existing evidence retrieval dataset by annotating a more complete set of evidence samples for both image and text retrieval. Our experimental results on two datasets demonstrate the superiority of the proposed approach in both evidence retrieval and fact verification tasks and also better generalization capability across dataset compared to the supervised baseline.

Multimodal Misinformation Detection using Large Vision-Language Models

TL;DR

Abstract

Paper Structure (32 sections, 1 equation, 5 figures, 5 tables)

This paper contains 32 sections, 1 equation, 5 figures, 5 tables.

Introduction
Related Work
Misinformation Detection
Generative AI Models
Multimodal Misinformation Detection
Problem Definition
Evidence Retrieval
Initial Retriever
Text Retrieval
Image Retrieval
Prompting
Re-ranking using Generative AI Models
Initial Ranking Scores (IRS)
Generative AI Scores (GAIS)
Fact Verification
...and 17 more sections

Figures (5)

Figure 1: Example of multimodal misinformation detection.
Figure 2: Overview of our misinformation detection approach. Blue border (LVLM4EV): Based on a textual input claim, evidence texts and images are initially retrieved from a corpus using a state-of-the-art approach (e.g., MOCHEGDBLP:mocheg). Generative LLMs (e.g., Mistral-7B DBLP:Mistral) and LVLMs (e.g., InstructBLIP DBLP:Instruct-blip) are used to re-rank the top-N text and image evidences. Green border (LVLM4FV): Based on the re-ranked evidences, we finally employ a LVLM (e.g., LLaVA llava-mistral) for misinformation detection.
Figure 3: Qualitative example for evidence retrieval component which also shows the incomplete ground truth issue in the MOCHEG dataset. Green border shows the labeled ground truth, Blue border shows evidences with similar content to the ground truth which are left unlabeled and red border shows irrelevant content.
Figure 4: An example of data annotation at entity level and evidence level for image retrieval (top) and text retrieval (bottom).
Figure 5: Mean average precision (mAP) of LVLM4EV using different re-ranking strategies (Section \ref{['ranking']}) for top: text retrieval and bottom: image retrieval.

Multimodal Misinformation Detection using Large Vision-Language Models

TL;DR

Abstract

Multimodal Misinformation Detection using Large Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)