Table of Contents
Fetching ...

Latent Reconstruction from Generated Data for Multimodal Misinformation Detection

Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis

TL;DR

This work tackles multimodal misinformation detection (MMD) under data scarcity by introducing MisCaption This!, a framework that uses Vision-Language Models to generate realistic miscaptioned image captions, and Latent Multimodal Reconstruction (LAMAR), a Transformer-based network that reconstructs the embedding of truthful captions from manipulated inputs. It systematically compares end-to-end and pre-training strategies, and four integration mechanisms (direct, mask, gate, attention), demonstrating that VLM-generated data markedly improves real-world generalization. LAMAR achieves new state-of-the-art on VERITE and NewsCLIPpings, with strong temporal generalization on VERITE 24/25, outperforming NES- and cross-modal-based baselines by notable margins. The work also discusses ethical considerations and highlights the potential and risks of using generated data for misinformation research, calling for responsible data sharing and future integration of external knowledge sources for comprehensive fact-checking.

Abstract

Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. Due to the scarcity of large-scale annotated datasets for multimodal misinformation detection (MMD), recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic, unrealistic examples, which limits their utility as training examples. To address this, we introduce "MisCaption This!", a framework for generating high-fidelity synthetic miscaptioned datasets through Adversarial Prompting of Vision-Language Models (VLMs). Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a Transformer-based network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" data generalize better to real-world misinformation, while LAMAR achieves new state-of-the-art on NewsCLIPpings, VERITE, and the newly introduced VERITE 24/25 benchmark; highlighting the efficacy of VLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction

Latent Reconstruction from Generated Data for Multimodal Misinformation Detection

TL;DR

This work tackles multimodal misinformation detection (MMD) under data scarcity by introducing MisCaption This!, a framework that uses Vision-Language Models to generate realistic miscaptioned image captions, and Latent Multimodal Reconstruction (LAMAR), a Transformer-based network that reconstructs the embedding of truthful captions from manipulated inputs. It systematically compares end-to-end and pre-training strategies, and four integration mechanisms (direct, mask, gate, attention), demonstrating that VLM-generated data markedly improves real-world generalization. LAMAR achieves new state-of-the-art on VERITE and NewsCLIPpings, with strong temporal generalization on VERITE 24/25, outperforming NES- and cross-modal-based baselines by notable margins. The work also discusses ethical considerations and highlights the potential and risks of using generated data for misinformation research, calling for responsible data sharing and future integration of external knowledge sources for comprehensive fact-checking.

Abstract

Multimodal misinformation, such as miscaptioned images, where captions misrepresent an image's origin, context, or meaning, poses a growing challenge in the digital age. Due to the scarcity of large-scale annotated datasets for multimodal misinformation detection (MMD), recent approaches rely on synthetic training data created via out-of-context pairings or named entity manipulations (e.g., altering names, dates, or locations). However, these often yield simplistic, unrealistic examples, which limits their utility as training examples. To address this, we introduce "MisCaption This!", a framework for generating high-fidelity synthetic miscaptioned datasets through Adversarial Prompting of Vision-Language Models (VLMs). Additionally, we introduce "Latent Multimodal Reconstruction" (LAMAR), a Transformer-based network trained to reconstruct the embeddings of truthful captions, providing a strong auxiliary signal to guide detection. We explore various training strategies (end-to-end vs. large-scale pre-training) and integration mechanisms (direct, mask, gate, and attention). Extensive experiments show that models trained on "MisCaption This!" data generalize better to real-world misinformation, while LAMAR achieves new state-of-the-art on NewsCLIPpings, VERITE, and the newly introduced VERITE 24/25 benchmark; highlighting the efficacy of VLM-generated data and reconstruction-based networks for advancing MMD. Our code is available at https://github.com/stevejpapad/miscaptioned-image-reconstruction

Paper Structure

This paper contains 32 sections, 15 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: High-level overview of the proposed workflow. A VLM generates a false caption from the original, truthful image-caption pair. The Reconstructor (Transformer encoder) then takes both the image and false caption embeddings as inputs to identify and rectify inaccuracies in the text, recreating the original truthful caption embedding. This reconstructed representation is fused with the other modalities via a specialized mechanism (e.g., Gating or Attention) and passed to the detector to produce the final verdict. The network is trained to simultaneously minimize the error between original and reconstructed embeddings and optimized for classification accuracy.
  • Figure 2: Adversarial Prompt Selection: An VLM 'Manipulator' generates a false caption using the generative prompt $p_0^{gen}$. The VLM 'Detector' is then evaluated on the zero-shot classification of both the truthful and generated captions. Intermittent lines indicate the prediction on a single sample, while the overall accuracy is calculated across a balanced set of 2,000 samples. In this specific case, the high overall detection accuracy indicates the generated misinformation is too simplistic; thus $p_0^{gen}$ is not selected for the creation of a training dataset.
  • Figure 3: Examples of truthful and generated captions, alongside false captions created via named entity swaps.
  • Figure 4: End-to-end training of the proposed LAMAR architecture. The input caption $C$ may be either truthful or falsified. A CLIP ViT-L/14 encoder and a Transformer-based reconstruction module, enhanced with element-wise vector operations for modality fusion, outputs the reconstructed embedding $\mathbf{\hat{C}^t}$. This is integrated into the detection network via some mechanisms (e.g., gate, mask, attention), which outputs the final verdict. The reconstruction module is trained with MSE loss against the ground-truth $\mathbf{C}^t$, while the detection module is trained with cross-entropy (CE).
  • Figure 5: Performance of detection models (DT-Transformer and RED-DOT) trained on four variations of LLaVA-$\mathcal{D}_1$, $\mathcal{D}_2$, $\mathcal{D}_3$, and $\mathcal{D}_4$, evaluated under varying filtering thresholds ($l \in \{0,5,10,15,25,50,\text{None}\}$, or 4.5%, 19.1%, 27.8%, 34.9%, 47.9%, 74.0%, and 100% of the dataset) in terms of Test-set Accuracy and VERITE "True vs. MC" Accuracy.