Table of Contents
Fetching ...

RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection

Stefanos-Iordanis Papadopoulos, Christos Koutlis, Symeon Papadopoulos, Panagiotis C. Petrantonakis

TL;DR

RED-DOT tackles multimodal misinformation by introducing Relevant Evidence Detection to filter external evidence and improve verdict prediction. The approach combines Evidence Retrieval and Re-ranking, Modality Fusion, and a RED module within a shared Transformer framework, trained with a multitask objective $L = L^v + L^e$. Key findings show that out-of-distribution evaluation (OOD-CV) generalizes from NewsCLIPings+ to VERITE, that evidence re-ranking with a single piece per modality is often optimal, and that explicit element-wise modality fusion boosts accuracy without requiring multiple backbones or excessive evidence. The work demonstrates significant gains over state-of-the-art baselines on NewsCLIPings+ and strong performance on VERITE, and provides code to support reproducibility and further research into relevance-aware evidence in multimodal fact-checking.

Abstract

Online misinformation is often multimodal in nature, i.e., it is caused by misleading associations between texts and accompanying images. To support the fact-checking process, researchers have been recently developing automatic multimodal methods that gather and analyze external information, evidence, related to the image-text pairs under examination. However, prior works assumed all external information collected from the web to be relevant. In this study, we introduce a "Relevant Evidence Detection" (RED) module to discern whether each piece of evidence is relevant, to support or refute the claim. Specifically, we develop the "Relevant Evidence Detection Directed Transformer" (RED-DOT) and explore multiple architectural variants (e.g., single or dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and comparative experiments demonstrate that RED-DOT achieves significant improvements over the state-of-the-art (SotA) on the VERITE benchmark by up to 33.7%. Furthermore, our evidence re-ranking and element-wise modality fusion led to RED-DOT surpassing the SotA on NewsCLIPings+ by up to 3% without the need for numerous evidence or multiple backbone encoders. We release our code at: https://github.com/stevejpapad/relevant-evidence-detection

RED-DOT: Multimodal Fact-checking via Relevant Evidence Detection

TL;DR

RED-DOT tackles multimodal misinformation by introducing Relevant Evidence Detection to filter external evidence and improve verdict prediction. The approach combines Evidence Retrieval and Re-ranking, Modality Fusion, and a RED module within a shared Transformer framework, trained with a multitask objective . Key findings show that out-of-distribution evaluation (OOD-CV) generalizes from NewsCLIPings+ to VERITE, that evidence re-ranking with a single piece per modality is often optimal, and that explicit element-wise modality fusion boosts accuracy without requiring multiple backbones or excessive evidence. The work demonstrates significant gains over state-of-the-art baselines on NewsCLIPings+ and strong performance on VERITE, and provides code to support reproducibility and further research into relevance-aware evidence in multimodal fact-checking.

Abstract

Online misinformation is often multimodal in nature, i.e., it is caused by misleading associations between texts and accompanying images. To support the fact-checking process, researchers have been recently developing automatic multimodal methods that gather and analyze external information, evidence, related to the image-text pairs under examination. However, prior works assumed all external information collected from the web to be relevant. In this study, we introduce a "Relevant Evidence Detection" (RED) module to discern whether each piece of evidence is relevant, to support or refute the claim. Specifically, we develop the "Relevant Evidence Detection Directed Transformer" (RED-DOT) and explore multiple architectural variants (e.g., single or dual-stage) and mechanisms (e.g., "guided attention"). Extensive ablation and comparative experiments demonstrate that RED-DOT achieves significant improvements over the state-of-the-art (SotA) on the VERITE benchmark by up to 33.7%. Furthermore, our evidence re-ranking and element-wise modality fusion led to RED-DOT surpassing the SotA on NewsCLIPings+ by up to 3% without the need for numerous evidence or multiple backbone encoders. We release our code at: https://github.com/stevejpapad/relevant-evidence-detection
Paper Structure (20 sections, 11 equations, 4 figures, 5 tables)

This paper contains 20 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Image-text pair under verification with external information (evidence), both images and texts, collected from the web. The proposed framework retrieves and re-ranks the evidence while RED-DOT determines which pieces of information are most relevant to (support or refute) the image-text pair and then uses those to determine the pair's veracity.
  • Figure 2: Visualization of Eq.\ref{['eqn:irrelevant_rank']}. Hard negative sampling for retrieving "irrelevant" evidence.
  • Figure 3: (a) Overview of the proposed Transformer $D(\cdot)$ in RED-DOT, employing "Modality Fusion". (b) High-level overview of the single and dual stage RED-DOT variants. Dotted lines represent the second stage in DSL.
  • Figure 4: Inference by RED-DOT variants: DSL and DSL+GA (w/ CLIP ViT B/32) on samples taken from NewsCLIPings+. We report the Attention scores of each method. "Relevance Ground truth" is set to [0, 0, 1, 1] for simplicity, regarding [$T^e-$, $I^e-$, $T^e+$, $I^e+$], respectively.