Zero-Shot Warning Generation for Misinformative Multimodal Content
Giovanni Pio Delvecchio, Huy Hong Nguyen, Isao Echizen
TL;DR
This work addresses misinformation, particularly out-of-context image-caption pairs, by introducing cross-modality consistency checks and a zero-shot warning-generation pathway. It presents a lightweight, parameter-efficient model and a full-scale variant that leverage external evidence and a frozen visual-language model to perform cross-modal reasoning and generate contextual warnings without fine-tuning. The architecture combines visual, textual, page-based, and multimodal reasoning with a warning generator guided by prompt-based zero-shot learning, validated on NewsCLIPpings with competitive accuracy and favorable training times. Human evaluation shows promising informativeness and quality of warnings, though challenges remain from noisy or conflicting evidence and incomplete input-context alignment.
Abstract
The widespread prevalence of misinformation poses significant societal concerns. Out-of-context misinformation, where authentic images are paired with false text, is particularly deceptive and easily misleads audiences. Most existing detection methods primarily evaluate image-text consistency but often lack sufficient explanations, which are essential for effectively debunking misinformation. We present a model that detects multimodal misinformation through cross-modality consistency checks, requiring minimal training time. Additionally, we propose a lightweight model that achieves competitive performance using only one-third of the parameters. We also introduce a dual-purpose zero-shot learning task for generating contextualized warnings, enabling automated debunking and enhancing user comprehension. Qualitative and human evaluations of the generated warnings highlight both the potential and limitations of our approach.
