CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

Devank; Jayateja Kalla; Soma Biswas

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

Devank, Jayateja Kalla, Soma Biswas

TL;DR

A novel framework is proposed, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data, which can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels.

Abstract

In this work, we address the real-world, challenging task of out-of-context misinformation detection, where a real image is paired with an incorrect caption for creating fake news. Existing approaches for this task assume the availability of large amounts of labeled data, which is often impractical in real-world, since it requires extensive manual intervention and domain expertise. In contrast, since obtaining a large corpus of unlabeled image-text pairs is much easier, here, we propose a semi-supervised protocol, where the model has access to a limited number of labeled image-text pairs and a large corpus of unlabeled pairs. Additionally, the occurrence of fake news being much lesser compared to the real ones, the datasets tend to be highly imbalanced, thus making the task even more challenging. Towards this goal, we propose a novel framework, Consensus from Vision-Language Models (CoVLM), which generates robust pseudo-labels for unlabeled pairs using thresholds derived from the labeled data. This approach can automatically determine the right threshold parameters of the model for selecting the confident pseudo-labels. Experimental results on benchmark datasets across challenging conditions and comparisons with state-of-the-art approaches demonstrate the effectiveness of our framework.

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

TL;DR

Abstract

Paper Structure (17 sections, 5 equations, 8 figures, 4 tables, 2 algorithms)

This paper contains 17 sections, 5 equations, 8 figures, 4 tables, 2 algorithms.

Introduction
Related Work
Problem Definition
CoVLM for Semi-Supervised MFND
Unlabeled data: Pseudo-Labels using Caption Consensus
Threshold Parameters from Labeled Data
Unified Training using both Labeled and Unlabeled Data
Experiments
Dataset details
Implementation Details
Baselines
Experimental Results
Analysis and Ablation Study
Impact of Data Imbalance in MFND
Impact of Amount of Unlabeled Data
...and 2 more sections

Figures (8)

Figure 1: A sample real and fake image-pair from the NewsCLIPpings dataset luo2021NewsCLIPpings. The model needs to capture the subtle inconsistencies between the image and text pairs to understand their authenticity.
Figure 2: Overview of CoVLM. For a given image-text pair, BLIP generates an additional image description. Using the original image, text, and the generated text, a decision is made on whether the pair is real or fake. This label is then used in training. The decision module’s threshold parameters are estimated from the labeled data.
Figure 3: Illustration of the pseudo-label assignment for the unlabeled image-text pairs. Using both image and text embeddings, CLIP consensus score $\mathcal{S}_{c}$ is calculated and for given text and generated text BLIP consensus score $\mathcal{S}_{b}$ is calculated.
Figure 3: Experiment results on imbalanced NewsCLIPpings dataset.
Figure 4: Illustration of the estimation of threshold parameters using labeled data. The BLIP consensus score $\mathcal{S}_{b}$ and the CLIP consensus score $\mathcal{S}_{b}$ are calculated for all labeled samples. The mean of these labeled BLIP and CLIP consensus scores acts as threshold parameters for the unlabeled data.
...and 3 more figures

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

TL;DR

Abstract

CoVLM: Leveraging Consensus from Vision-Language Models for Semi-supervised Multi-modal Fake News Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (8)