Table of Contents
Fetching ...

Zero-Shot Warning Generation for Misinformative Multimodal Content

Giovanni Pio Delvecchio, Huy Hong Nguyen, Isao Echizen

TL;DR

This work addresses misinformation, particularly out-of-context image-caption pairs, by introducing cross-modality consistency checks and a zero-shot warning-generation pathway. It presents a lightweight, parameter-efficient model and a full-scale variant that leverage external evidence and a frozen visual-language model to perform cross-modal reasoning and generate contextual warnings without fine-tuning. The architecture combines visual, textual, page-based, and multimodal reasoning with a warning generator guided by prompt-based zero-shot learning, validated on NewsCLIPpings with competitive accuracy and favorable training times. Human evaluation shows promising informativeness and quality of warnings, though challenges remain from noisy or conflicting evidence and incomplete input-context alignment.

Abstract

The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.

Zero-Shot Warning Generation for Misinformative Multimodal Content

TL;DR

This work addresses misinformation, particularly out-of-context image-caption pairs, by introducing cross-modality consistency checks and a zero-shot warning-generation pathway. It presents a lightweight, parameter-efficient model and a full-scale variant that leverage external evidence and a frozen visual-language model to perform cross-modal reasoning and generate contextual warnings without fine-tuning. The architecture combines visual, textual, page-based, and multimodal reasoning with a warning generator guided by prompt-based zero-shot learning, validated on NewsCLIPpings with competitive accuracy and favorable training times. Human evaluation shows promising informativeness and quality of warnings, though challenges remain from noisy or conflicting evidence and incomplete input-context alignment.

Abstract

The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.

Paper Structure

This paper contains 24 sections, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Proposed pipeline for debunking misleading content. Each input image $I^{q}$ is used as a query to retrieve text $C^{e}$ from the web via inverse search, while each input caption $C^{q}$ retrieves images $I^{e}$ through direct search. Labels $L^{q}$ and $L^{e}$ are extracted using the Google Cloud Vision API. Consistency checks are needed to confront input data with evidence of the same type, in green we have the input image-caption consistency check. Each consistency check provides a consistency score; $S_{\text{pred}}$ is a vector containing all of them. The source pages $P^{e}$ of each piece of evidence are ranked to identify the most relevant pages $P_{\text{attn}}$. A frozen VLM with a custom prompt containing $I^{q}$, $C^{q}$, $S_{\text{pred}}$ and $P_{\text{attn}}$ generates an explanation contextualizing the input.
  • Figure 2: Overview of the proposed architecture for multimodal misinformation detection: Each consistency block provides a score in the range $[-1, 1]$ as it is the result of cosine similarity. The only exception is the score $S_{\text{logit}}$, which is the outcome of the multimodal consistency block (highlighted in green). The scores are concatenated, and the resulting vector is passed to the classification head, which consists of batch normalization ioffe2015batch and a linear layer. $P_{\text{attn}}$ are the source pages corresponding to the evidence with the highest attention score from each attention-based block ($I^{e}_{\text{first}}$, $L^{e}_{\text{first}}$, $C^{e}_{\text{first}}$, $P^{e}_{\text{first}}$). The content of $P_{\text{attn}}$, together with the input pair and the final score $P_{\text{class}}$ are used to construct the input prompt of MiniGPT-4 for the purpose of warning generation.