Table of Contents
Fetching ...

Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model

Yizhou Zhang, Loc Trinh, Defu Cao, Zijun Cui, Yan Liu

TL;DR

The paper tackles out-of-context multimedia misinformation by introducing an interpretable cross-modal detector that symbolically decomposes caption text into an AMR graph to generate elementary queries. These queries are evaluated against the accompanying image using a large vision-language model, and a query ranker selects the most informative queries to form evidence-backed predictions, with the supportive probability defined as $P_S(q)=P(f(h_f)=0)+P(f(h_f)=1)$. The authors demonstrate that this evidence-based, neural-symbolic framework achieves competitive accuracy on NewsCLIPpings while providing interpretable outputs, addressing the need for explainable fact-checking. Limitations include focus on factual inconsistencies and coarse evidences, with future work aimed at region-level localization and generating clarifications for verifiers.

Abstract

Recent years have witnessed the sustained evolution of misinformation that aims at manipulating public opinions. Unlike traditional rumors or fake news editors who mainly rely on generated and/or counterfeited images, text and videos, current misinformation creators now more tend to use out-of-context multimedia contents (e.g. mismatched images and captions) to deceive the public and fake news detection systems. This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information. To address this challenge, in this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions, which is helpful for fact-check websites to document clarifications. The proposed model first symbolically disassembles the text-modality information to a set of fact queries based on the Abstract Meaning Representation of the caption and then forwards the query-image pairs into a pre-trained large vision-language model select the ``evidences" that are helpful for us to detect misinformation. Extensive experiments indicate that the proposed methodology can provide us with much more interpretable predictions while maintaining the accuracy same as the state-of-the-art model on this task.

Interpretable Detection of Out-of-Context Misinformation with Neural-Symbolic-Enhanced Large Multimodal Model

TL;DR

The paper tackles out-of-context multimedia misinformation by introducing an interpretable cross-modal detector that symbolically decomposes caption text into an AMR graph to generate elementary queries. These queries are evaluated against the accompanying image using a large vision-language model, and a query ranker selects the most informative queries to form evidence-backed predictions, with the supportive probability defined as . The authors demonstrate that this evidence-based, neural-symbolic framework achieves competitive accuracy on NewsCLIPpings while providing interpretable outputs, addressing the need for explainable fact-checking. Limitations include focus on factual inconsistencies and coarse evidences, with future work aimed at region-level localization and generating clarifications for verifiers.

Abstract

Recent years have witnessed the sustained evolution of misinformation that aims at manipulating public opinions. Unlike traditional rumors or fake news editors who mainly rely on generated and/or counterfeited images, text and videos, current misinformation creators now more tend to use out-of-context multimedia contents (e.g. mismatched images and captions) to deceive the public and fake news detection systems. This new type of misinformation increases the difficulty of not only detection but also clarification, because every individual modality is close enough to true information. To address this challenge, in this paper we explore how to achieve interpretable cross-modal de-contextualization detection that simultaneously identifies the mismatched pairs and the cross-modal contradictions, which is helpful for fact-check websites to document clarifications. The proposed model first symbolically disassembles the text-modality information to a set of fact queries based on the Abstract Meaning Representation of the caption and then forwards the query-image pairs into a pre-trained large vision-language model select the ``evidences" that are helpful for us to detect misinformation. Extensive experiments indicate that the proposed methodology can provide us with much more interpretable predictions while maintaining the accuracy same as the state-of-the-art model on this task.
Paper Structure (12 sections, 4 equations, 3 figures, 2 tables, 2 algorithms)

This paper contains 12 sections, 4 equations, 3 figures, 2 tables, 2 algorithms.

Figures (3)

  • Figure 1: Examples of mismatched text-image pairs. The left pair is mismatched because the image is obviously taken in winter (the clothes and background), rather than Independence Day. And the right pair is mismatched because the cars are with yellow license plates, but in China the taxi uses blue license plates.
  • Figure 2: Overview of our proposed method. It first parses the text to AMR graphs with on-the-shelf tools. Then it extracts queries with a symbolic elementary fact extraction algorithm designed by us. After that, a large-pre-trained multi-modal model will determine whether the queries are supported by the vision input or not. Finally, a query ranker will select the important and reliable queries as the evidence to make the final judge.
  • Figure 3: Pipeline of query ranker. The caption and query embeddings are respectively boosted by a vision embedding vector and then fused to acquire a final representation $h_f$. After that, $h_f$ is forwarded into a predictor to acquire the final evidence score.