VGA: Vision and Graph Fused Attention Network for Rumor Detection
Lin Bai, Caiyan Jia, Ziying Song, Chaoqun Cui
TL;DR
This work introduces VGA, a Vision and Graph Fused Attention Network for multimodal rumor detection, integrating propagation structures, visual semantics, image tampering cues, and OCR-derived text within a jointly trained, attention-based framework. It proposes using SRM-filtered noise features to capture tampering, mutual enhanced co-attention to fuse graph and visual modalities, and a similarity loss to align cross-modal representations, achieving state-of-the-art performance on three real-world datasets including a newly released DRWeiboMM. A comprehensive ablation study confirms the critical roles of the similarity measurement, data augmentation, OCR augmentation, and tampering features, with case studies illustrating practical benefits. The approach advances rumor detection by effectively exploiting crowd wisdom encoded in propagation graphs and robust visual cues, enabling more reliable detection in multimodal and manipulated content scenarios.
Abstract
With the development of social media, rumors have been spread broadly on social media platforms, causing great harm to society. Beside textual information, many rumors also use manipulated images or conceal textual information within images to deceive people and avoid being detected, making multimodal rumor detection be a critical problem. The majority of multimodal rumor detection methods mainly concentrate on extracting features of source claims and their corresponding images, while ignoring the comments of rumors and their propagation structures. These comments and structures imply the wisdom of crowds and are proved to be crucial to debunk rumors. Moreover, these methods usually only extract visual features in a basic manner, seldom consider tampering or textual information in images. Therefore, in this study, we propose a novel Vision and Graph Fused Attention Network (VGA) for rumor detection to utilize propagation structures among posts so as to obtain the crowd opinions and further explore visual tampering features, as well as the textual information hidden in images. We conduct extensive experiments on three datasets, demonstrating that VGA can effectively detect multimodal rumors and outperform state-of-the-art methods significantly.
