Table of Contents
Fetching ...

Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization

Huanhuan Ma, Jinghao Zhang, Qiang Liu, Shu Wu, Liang Wang

TL;DR

This work tackles the interpretable detection of out-of-context image-caption misinformation. It introduces LOGRAN, a latent-variable framework that decomposes captions into phrases, predicts phrase-level veracity, and aggregates these via soft-logic rules to produce a final verdict, with a teacher-student distillation mechanism to inject logical consistency. The method leverages a variational EM approach to learn phrase-level predictions without phrase-level labels, and demonstrates improvements over strong baselines on the NewsCLIPpings dataset while offering interpretable explanations through identified culprits. The approach has practical impact by providing both accurate detection and interpretable, localized explanations that reveal which phrases drive the final decision, applicable across state-of-the-art visual-language backbones.

Abstract

The rapid spread of information through mobile devices and media has led to the widespread of false or deceptive news, causing significant concerns in society. Among different types of misinformation, image repurposing, also known as out-of-context misinformation, remains highly prevalent and effective. However, current approaches for detecting out-of-context misinformation often lack interpretability and offer limited explanations. In this study, we propose a logic regularization approach for out-of-context detection called LOGRAN (LOGic Regularization for out-of-context ANalysis). The primary objective of LOGRAN is to decompose the out-of-context detection at the phrase level. By employing latent variables for phrase-level predictions, the final prediction of the image-caption pair can be aggregated using logical rules. The latent variables also provide an explanation for how the final result is derived, making this fine-grained detection method inherently explanatory. We evaluate the performance of LOGRAN on the NewsCLIPpings dataset, showcasing competitive overall results. Visualized examples also reveal faithful phrase-level predictions of out-of-context images, accompanied by explanations. This highlights the effectiveness of our approach in addressing out-of-context detection and enhancing interpretability.

Interpretable Multimodal Out-of-context Detection with Soft Logic Regularization

TL;DR

This work tackles the interpretable detection of out-of-context image-caption misinformation. It introduces LOGRAN, a latent-variable framework that decomposes captions into phrases, predicts phrase-level veracity, and aggregates these via soft-logic rules to produce a final verdict, with a teacher-student distillation mechanism to inject logical consistency. The method leverages a variational EM approach to learn phrase-level predictions without phrase-level labels, and demonstrates improvements over strong baselines on the NewsCLIPpings dataset while offering interpretable explanations through identified culprits. The approach has practical impact by providing both accurate detection and interpretable, localized explanations that reveal which phrases drive the final decision, applicable across state-of-the-art visual-language backbones.

Abstract

The rapid spread of information through mobile devices and media has led to the widespread of false or deceptive news, causing significant concerns in society. Among different types of misinformation, image repurposing, also known as out-of-context misinformation, remains highly prevalent and effective. However, current approaches for detecting out-of-context misinformation often lack interpretability and offer limited explanations. In this study, we propose a logic regularization approach for out-of-context detection called LOGRAN (LOGic Regularization for out-of-context ANalysis). The primary objective of LOGRAN is to decompose the out-of-context detection at the phrase level. By employing latent variables for phrase-level predictions, the final prediction of the image-caption pair can be aggregated using logical rules. The latent variables also provide an explanation for how the final result is derived, making this fine-grained detection method inherently explanatory. We evaluate the performance of LOGRAN on the NewsCLIPpings dataset, showcasing competitive overall results. Visualized examples also reveal faithful phrase-level predictions of out-of-context images, accompanied by explanations. This highlights the effectiveness of our approach in addressing out-of-context detection and enhancing interpretability.
Paper Structure (16 sections, 6 equations, 2 figures, 1 table)

This paper contains 16 sections, 6 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: An example of the pipeline of how LOGRAN works. LOGRAN not only predicts the overall veracity of the image-caption pair, but also uses latent variables to indicate the veracity at phrase level. In this example, we can easily find the "Culprit" of this caption is $w_1$. It manipulates the subject and the pristine caption is "Tunisian and Angolan players fight for the ball on Sunday during a handball tournament in Spain Angola go on to win".
  • Figure 2: Some examples showcasing the outputs of LOGRAN. Each example includes a pair of images with matching captions. The left image is labeled as "Pristine", while the right image is labeled as "Falsified". We use the color red to highlight the phrases identified as "Culprit".