Latent Reconstruction from Generated Data for Multimodal Misinformation Detection
Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis
TL;DR
This work tackles multimodal misinformation detection (MMD) under data scarcity by introducing MisCaption This!, a framework that uses Vision-Language Models to generate realistic miscaptioned image captions, and Latent Multimodal Reconstruction (LAMAR), a Transformer-based network that reconstructs the embedding of truthful captions from manipulated inputs. It systematically compares end-to-end and pre-training strategies, and four integration mechanisms (direct, mask, gate, attention), demonstrating that VLM-generated data markedly improves real-world generalization. LAMAR achieves new state-of-the-art on VERITE and NewsCLIPpings, with strong temporal generalization on VERITE 24/25, outperforming NES- and cross-modal-based baselines by notable margins. The work also discusses ethical considerations and highlights the potential and risks of using generated data for misinformation research, calling for responsible data sharing and future integration of external knowledge sources for comprehensive fact-checking.
Abstract
Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. Due to the scarcity of large-scale annotated datasets for multimodal misinformation detection (MMD), recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic, unrealistic examples, which limits their utility as training examples. To address this, we introduce "MisCaption This!", a framework for generating high-fidelity synthetic miscaptioned datasets through Adversarial Prompting of Vision-Language Models (VLMs). Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a Transformer-based network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" data generalize better to real-world misinformation, while LAMAR achieves new state-of-the-art on NewsCLIPpings, VERITE, and the newly introduced VERITE 24/25 benchmark; highlighting the efficacy of VLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction
