Table of Contents
Fetching ...

MultiCheck: Strengthening Web Trust with Unified Multimodal Fact Verification

Aditya Kishore, Gaurav Kumar, Jasabanta Patro

TL;DR

Multimodal misinformation increasingly blends text, images, and OCR content, challenging traditional unimodal fact-checkers. The authors introduce MultiCheck, a lightweight, end-to-end framework that jointly reasons over claim text, images, and OCR signals using a relational fusion module based on element-wise difference and product, coupled with a contrastive InfoNCE objective to align semantically related claim–document pairs. Training combines cross-entropy with a contrastive loss (λ = 0.1), yielding strong cross-modal representations without heavy generative decoding. Empirically, MultiCheck achieves large macro-F1 gains on Factify-2 and Mocheg compared with strong baselines, remains robust under OCR noise and modality imbalance, and supports memory-efficient deployment via 4-bit quantization and QLoRA while preserving performance. This work offers practical, transparent multimodal verification suitable for journalists and web integrity efforts seeking safer online information ecosystems.

Abstract

Misinformation on the web increasingly appears in multimodal forms, combining text, images, and OCR-rendered content in ways that amplify harm to public trust and vulnerable communities. While prior fact-checking systems often rely on unimodal signals or shallow fusion strategies, modern misinformation campaigns operate across modalities and require models that can reason over subtle cross-modal inconsistencies in a transparent and responsible manner. We introduce MultiCheck, a lightweight and interpretable framework for multimodal fact verification that jointly analyzes textual, visual, and OCR evidence. At its core, MultiCheck employs a relational fusion module based on element-wise difference and product operations, allowing for explicit cross-modal interaction modeling with minimal computational overhead. A contrastive alignment objective further helps the model distinguish between supporting and refuting evidence while maintaining a small memory and energy footprint, making it suitable for low-resource deployment. Evaluated on the Factify-2 (5-class) and Mocheg (3-class) benchmarks, MultiCheck achieves huge performance improvement and remains robust under noisy OCR and missing modality conditions. Its efficiency, transparency, and real-world robustness make it well-suited for journalists, civil society organisations, and web integrity efforts working to build a safer and more trustworthy web.

MultiCheck: Strengthening Web Trust with Unified Multimodal Fact Verification

TL;DR

Multimodal misinformation increasingly blends text, images, and OCR content, challenging traditional unimodal fact-checkers. The authors introduce MultiCheck, a lightweight, end-to-end framework that jointly reasons over claim text, images, and OCR signals using a relational fusion module based on element-wise difference and product, coupled with a contrastive InfoNCE objective to align semantically related claim–document pairs. Training combines cross-entropy with a contrastive loss (λ = 0.1), yielding strong cross-modal representations without heavy generative decoding. Empirically, MultiCheck achieves large macro-F1 gains on Factify-2 and Mocheg compared with strong baselines, remains robust under OCR noise and modality imbalance, and supports memory-efficient deployment via 4-bit quantization and QLoRA while preserving performance. This work offers practical, transparent multimodal verification suitable for journalists and web integrity efforts seeking safer online information ecosystems.

Abstract

Misinformation on the web increasingly appears in multimodal forms, combining text, images, and OCR-rendered content in ways that amplify harm to public trust and vulnerable communities. While prior fact-checking systems often rely on unimodal signals or shallow fusion strategies, modern misinformation campaigns operate across modalities and require models that can reason over subtle cross-modal inconsistencies in a transparent and responsible manner. We introduce MultiCheck, a lightweight and interpretable framework for multimodal fact verification that jointly analyzes textual, visual, and OCR evidence. At its core, MultiCheck employs a relational fusion module based on element-wise difference and product operations, allowing for explicit cross-modal interaction modeling with minimal computational overhead. A contrastive alignment objective further helps the model distinguish between supporting and refuting evidence while maintaining a small memory and energy footprint, making it suitable for low-resource deployment. Evaluated on the Factify-2 (5-class) and Mocheg (3-class) benchmarks, MultiCheck achieves huge performance improvement and remains robust under noisy OCR and missing modality conditions. Its efficiency, transparency, and real-world robustness make it well-suited for journalists, civil society organisations, and web integrity efforts working to build a safer and more trustworthy web.

Paper Structure

This paper contains 17 sections, 17 equations, 7 figures, 19 tables.

Figures (7)

  • Figure 1: Refuting a viral claim using combined text and image evidence.
  • Figure 2: Example of a sample from the Factify 2.
  • Figure 3: Illustration: The claim suggests Nepal shot down an Indian helicopter. However, the OCR text contradicts this by suggesting Indian aggression, not Nepali. Without OCR, the model could misclassify this. The fused text representation enables correct "Refute" labeling.
  • Figure 4: Intuitive fusion representation using element-wise difference and product.
  • Figure 5: Example of qualitative analysis, Sample from the dataset.
  • ...and 2 more figures