Table of Contents
Fetching ...

Cross-Modal Augmentation for Few-Shot Multimodal Fake News Detection

Ye Jiang, Taihang Wang, Xiaoman Xu, Yimin Wang, Xingyi Song, Diana Maynard

TL;DR

This paper tackles the challenge of few-shot multimodal fake news detection by introducing Cross-Modal Augmentation (CMA), which augments multimodal features with unimodal cues to transform standard $n$-shot learning into a robust $(n \times z)$-shot regime using a fixed pretrained encoder and linear probing. CMA leverages CLIP-based text and image representations and cross-attention to generate five modality-specific inferences, which are then fused by a meta-linear classifier. Empirical results across PolitiFact, GossipCop, and Weibo show CMA achieves state-of-the-art accuracy with substantially lower training overhead than fine-tuned large models, highlighting both effectiveness and efficiency in few-shot settings. The work also provides extensive ablations, stability analyses, and domain-shift evaluations, offering insight into when and why unimodal augmentation helps multimodal fake news detection. Limitations include reliance on CLIP and cosine-based image selection, with future directions toward broader multimodal encoders and domain adaptation strategies.

Abstract

The nascent topic of fake news requires automatic detection methods to quickly learn from limited annotated samples. Therefore, the capacity to rapidly acquire proficiency in a new task with limited guidance, also known as few-shot learning, is critical for detecting fake news in its early stages. Existing approaches either involve fine-tuning pre-trained language models which come with a large number of parameters, or training a complex neural network from scratch with large-scale annotated datasets. This paper presents a multimodal fake news detection model which augments multimodal features using unimodal features. For this purpose, we introduce Cross-Modal Augmentation (CMA), a simple approach for enhancing few-shot multimodal fake news detection by transforming n-shot classification into a more robust (n $\times$ z)-shot problem, where z represents the number of supplementary features. The proposed CMA achieves SOTA results over three benchmark datasets, utilizing a surprisingly simple linear probing method to classify multimodal fake news with only a few training samples. Furthermore, our method is significantly more lightweight than prior approaches, particularly in terms of the number of trainable parameters and epoch times. The code is available here: \url{https://github.com/zgjiangtoby/FND_fewshot}

Cross-Modal Augmentation for Few-Shot Multimodal Fake News Detection

TL;DR

This paper tackles the challenge of few-shot multimodal fake news detection by introducing Cross-Modal Augmentation (CMA), which augments multimodal features with unimodal cues to transform standard -shot learning into a robust -shot regime using a fixed pretrained encoder and linear probing. CMA leverages CLIP-based text and image representations and cross-attention to generate five modality-specific inferences, which are then fused by a meta-linear classifier. Empirical results across PolitiFact, GossipCop, and Weibo show CMA achieves state-of-the-art accuracy with substantially lower training overhead than fine-tuned large models, highlighting both effectiveness and efficiency in few-shot settings. The work also provides extensive ablations, stability analyses, and domain-shift evaluations, offering insight into when and why unimodal augmentation helps multimodal fake news detection. Limitations include reliance on CLIP and cosine-based image selection, with future directions toward broader multimodal encoders and domain adaptation strategies.

Abstract

The nascent topic of fake news requires automatic detection methods to quickly learn from limited annotated samples. Therefore, the capacity to rapidly acquire proficiency in a new task with limited guidance, also known as few-shot learning, is critical for detecting fake news in its early stages. Existing approaches either involve fine-tuning pre-trained language models which come with a large number of parameters, or training a complex neural network from scratch with large-scale annotated datasets. This paper presents a multimodal fake news detection model which augments multimodal features using unimodal features. For this purpose, we introduce Cross-Modal Augmentation (CMA), a simple approach for enhancing few-shot multimodal fake news detection by transforming n-shot classification into a more robust (n z)-shot problem, where z represents the number of supplementary features. The proposed CMA achieves SOTA results over three benchmark datasets, utilizing a surprisingly simple linear probing method to classify multimodal fake news with only a few training samples. Furthermore, our method is significantly more lightweight than prior approaches, particularly in terms of the number of trainable parameters and epoch times. The code is available here: \url{https://github.com/zgjiangtoby/FND_fewshot}
Paper Structure (21 sections, 5 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 5 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: Information from different modalities assists humans in decision-making, especially when faced with uncertainty.
  • Figure 2: The overall architecture of the CMA model.
  • Figure 3: The standard deviations of accuracies for both PolitiFact and GossipCop datasets among the few-shot baselines and the proposed CMA.
  • Figure 4: Feature visualization comparisons between M-SAMPLE and CMA. English translation of the Weibo example: "When you buy toothpaste, pay attention to the color bar on the bottom of the toothpaste tube, the color bar has meaning! Try to choose greens and blues. Green: natural, blue: natural + medicine, Red: natural + chemical composition, Black: pure chemical. Surprisingly, most children's toothpaste brands on the domestic market contain chemical ingredients."