Table of Contents
Fetching ...

Evidence-Grounded Multimodal Misinformation Detection with Attention-Based GNNs

Sharad Duwal, Mir Nafis Sharear Shopnil, Abhishek Tyagi, Adiba Mahbub Proma

TL;DR

This work tackles multimodal out-of-context misinformation by contextualizing images with external textual evidence before reasoning about veracity. It introduces EGMMG, a graph-grounded detector that builds an evidence graph from online sources and a claim graph from captions, and uses cross-graph attention over a GNN to produce a veracity score. By separating grounding from reasoning and emphasizing external context, the method reduces reliance on large language models, mitigating hallucinations and enabling a smaller, task-specific model. Across Factify and related datasets, EGMMG outperforms frontier LLMs given the same contextual information, achieving high accuracy and demonstrating strong generalization and efficiency with a modest parameter count (~10.7M). This approach offers a scalable, interpretable alternative for multimodal misinformation detection with practical impact for fact-checking workflows and deployed detection systems.

Abstract

Multimodal out-of-context (OOC) misinformation is misinformation that repurposes real images with unrelated or misleading captions. Detecting such misinformation is challenging because it requires resolving the context of the claim before checking for misinformation. Many current methods, including LLMs and LVLMs, do not perform this contextualization step. LLMs hallucinate in absence of context or parametric knowledge. In this work, we propose a graph-based method that evaluates the consistency between the image and the caption by constructing two graph representations: an evidence graph, derived from online textual evidence, and a claim graph, from the claim in the caption. Using graph neural networks (GNNs) to encode and compare these representations, our framework then evaluates the truthfulness of image-caption pairs. We create datasets for our graph-based method, evaluate and compare our baseline model against popular LLMs on the misinformation detection task. Our method scores $93.05\%$ detection accuracy on the evaluation set and outperforms the second-best performing method (an LLM) by $2.82\%$, making a case for smaller and task-specific methods.

Evidence-Grounded Multimodal Misinformation Detection with Attention-Based GNNs

TL;DR

This work tackles multimodal out-of-context misinformation by contextualizing images with external textual evidence before reasoning about veracity. It introduces EGMMG, a graph-grounded detector that builds an evidence graph from online sources and a claim graph from captions, and uses cross-graph attention over a GNN to produce a veracity score. By separating grounding from reasoning and emphasizing external context, the method reduces reliance on large language models, mitigating hallucinations and enabling a smaller, task-specific model. Across Factify and related datasets, EGMMG outperforms frontier LLMs given the same contextual information, achieving high accuracy and demonstrating strong generalization and efficiency with a modest parameter count (~10.7M). This approach offers a scalable, interpretable alternative for multimodal misinformation detection with practical impact for fact-checking workflows and deployed detection systems.

Abstract

Multimodal out-of-context (OOC) misinformation is misinformation that repurposes real images with unrelated or misleading captions. Detecting such misinformation is challenging because it requires resolving the context of the claim before checking for misinformation. Many current methods, including LLMs and LVLMs, do not perform this contextualization step. LLMs hallucinate in absence of context or parametric knowledge. In this work, we propose a graph-based method that evaluates the consistency between the image and the caption by constructing two graph representations: an evidence graph, derived from online textual evidence, and a claim graph, from the claim in the caption. Using graph neural networks (GNNs) to encode and compare these representations, our framework then evaluates the truthfulness of image-caption pairs. We create datasets for our graph-based method, evaluate and compare our baseline model against popular LLMs on the misinformation detection task. Our method scores detection accuracy on the evaluation set and outperforms the second-best performing method (an LLM) by , making a case for smaller and task-specific methods.

Paper Structure

This paper contains 25 sections, 7 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: The EGMMG pipeline. For an image-claim sample, the pipeline prepares two graphs, evidence graph and claim graph, using online evidence retrieval followed by a rule-based analysis of subject-object relations in the evidence documents. Once we have the two graphs, we use a graph attention-based classifier to detect misinformation.
  • Figure 3: The EGMMG classifier.
  • Figure 4: Evidence graph generated by EGMMG for the example in Figure \ref{['fig:gandhi-voting-comprehensive']}
  • Figure 5: Prompt used to evaluate misinformation detection performance of LLMs (Sonnet, Haiku, GPT). For the EVAL_SUFFICIENT set, we allow one more option: "not enough information".
  • Figure : (a) Image
  • ...and 2 more figures