Table of Contents
Fetching ...

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

Devank, Jayateja Kalla, Soma Biswas

TL;DR

A novel framework is proposed, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data, which can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels.

Abstract

In this work, we address the real-world, challenging task of out-of-context misinformation detection, where a real image is paired with an incorrect caption for creating fake news. Existing approaches for this task assume the availability of large amounts of labeled data, which is often impractical in real-world, since it requires extensive manual intervention and domain expertise. In contrast, since obtaining a large corpus of unlabeled image-text pairs is much easier, here, we propose a semi-supervised protocol, where the model has access to a limited number of labeled image-text pairs and a large corpus of unlabeled pairs. Additionally, the occurrence of fake news being much lesser compared to the real ones, the datasets tend to be highly imbalanced, thus making the task even more challenging. Towards this goal, we propose a novel framework, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data. This approach can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels. Experimental results on benchmark datasets across challenging conditions and comparisons with state-of-the-art approaches demonstrate the effectiveness of our framework.

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

TL;DR

A novel framework is proposed, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data, which can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels.

Abstract

In this work, we address the real-world, challenging task of out-of-context misinformation detection, where a real image is paired with an incorrect caption for creating fake news. Existing approaches for this task assume the availability of large amounts of labeled data, which is often impractical in real-world, since it requires extensive manual intervention and domain expertise. In contrast, since obtaining a large corpus of unlabeled image-text pairs is much easier, here, we propose a semi-supervised protocol, where the model has access to a limited number of labeled image-text pairs and a large corpus of unlabeled pairs. Additionally, the occurrence of fake news being much lesser compared to the real ones, the datasets tend to be highly imbalanced, thus making the task even more challenging. Towards this goal, we propose a novel framework, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data. This approach can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels. Experimental results on benchmark datasets across challenging conditions and comparisons with state-of-the-art approaches demonstrate the effectiveness of our framework.
Paper Structure (17 sections, 5 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 5 equations, 8 figures, 4 tables, 2 algorithms.

Figures (8)

  • Figure 1: A sample real and fake image-pair from the NewsCLIPpings dataset luo2021NewsCLIPpings. The model needs to capture the subtle inconsistencies between the image and text pairs to understand their authenticity.
  • Figure 2: Overview of CoVLM. For a given image-text pair, BLIP generates an additional image description. Using the original image, text, and the generated text, a decision is made on whether the pair is real or fake. This label is then used in training. The decision module’s threshold parameters are estimated from the labeled data.
  • Figure 3: Illustration of the pseudo-label assignment for the unlabeled image-text pairs. Using both image and text embeddings, CLIP consensus score $\mathcal{S}_{c}$ is calculated and for given text and generated text BLIP consensus score $\mathcal{S}_{b}$ is calculated.
  • Figure 3: Experiment results on imbalanced NewsCLIPpings dataset.
  • Figure 4: Illustration of the estimation of threshold parameters using labeled data. The BLIP consensus score $\mathcal{S}_{b}$ and the CLIP consensus score $\mathcal{S}_{b}$ are calculated for all labeled samples. The mean of these labeled BLIP and CLIP consensus scores acts as threshold parameters for the unlabeled data.
  • ...and 3 more figures