Table of Contents
Fetching ...

TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer

Eunjee Choi, Jong-Kook Kim

TL;DR

This paper tackles multimodal fake news detection by jointly leveraging text and image information through a three-pathway architecture that combines BERT/BLIPTxt for text, ResNet/BLIPImg for images, and BLIP-based image-text features. It introduces the Multimodal Tri-Transformer to fuse text, image, and image-text representations using cross-modal and self-attention, prioritizing textual cues while maintaining cross-modal context. Evaluations on Weibo and Gossipcop demonstrate state-of-the-art performance, with TT-BLIP achieving Accuracies of 96.1% and 88.5%, respectively, outperforming traditional fusion and unimodal baselines. The study establishes the value of specialized feature extraction and integrated fusion for reliable detection, offering a practical approach for combating misinformation across social media platforms.

Abstract

Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-language understanding and generation (BLIP) for three types of information: BERT and BLIPTxt for text, ResNet and BLIPImg for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.

TT-BLIP: Enhancing Fake News Detection Using BLIP and Tri-Transformer

TL;DR

This paper tackles multimodal fake news detection by jointly leveraging text and image information through a three-pathway architecture that combines BERT/BLIPTxt for text, ResNet/BLIPImg for images, and BLIP-based image-text features. It introduces the Multimodal Tri-Transformer to fuse text, image, and image-text representations using cross-modal and self-attention, prioritizing textual cues while maintaining cross-modal context. Evaluations on Weibo and Gossipcop demonstrate state-of-the-art performance, with TT-BLIP achieving Accuracies of 96.1% and 88.5%, respectively, outperforming traditional fusion and unimodal baselines. The study establishes the value of specialized feature extraction and integrated fusion for reliable detection, offering a practical approach for combating misinformation across social media platforms.

Abstract

Detecting fake news has received a lot of attention. Many previous methods concatenate independently encoded unimodal data, ignoring the benefits of integrated multimodal information. Also, the absence of specialized feature extraction for text and images further limits these methods. This paper introduces an end-to-end model called TT-BLIP that applies the bootstrapping language-image pretraining for unified vision-language understanding and generation (BLIP) for three types of information: BERT and BLIPTxt for text, ResNet and BLIPImg for images, and bidirectional BLIP encoders for multimodal information. The Multimodal Tri-Transformer fuses tri-modal features using three types of multi-head attention mechanisms, ensuring integrated modalities for enhanced representations and improved multimodal data analysis. The experiments are performed using two fake news datasets, Weibo and Gossipcop. The results indicate TT-BLIP outperforms the state-of-the-art models.
Paper Structure (19 sections, 7 equations, 4 figures, 3 tables)

This paper contains 19 sections, 7 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Real news (a) and fake news (b) examples from the Weibo dataset
  • Figure 2: The architecture of the proposed TT-BLIP.
  • Figure 3: Architecture of different fusion strategies for Multimodal fake news detection
  • Figure 4: t-SNE visualization of extracted features from the Weibo test set using TT-BLIP. Each color represents a distinct label grouping.