Table of Contents
Fetching ...

Fake News Detection and Manipulation Reasoning via Large Vision-Language Models

Ruihan Jin, Ruibo Fu, Zhengqi Wen, Shuai Zhang, Yukun Liu, Jianhua Tao

TL;DR

This work tackles fake news detection in a multi-modal setting by introducing manipulation reasoning as a core capability. It presents the Human-centric and Fact-related Fake News (HFFN) benchmark to stress human faces and factual accuracy, paired with three manipulation types across four domains, and a rich annotation scheme. It then proposes M-DRUM, a large vision-language model that combines multi-modal feature extraction (via ImageBind and a facial encoder), cross-modal fusion, and a prompt-learning-based LVLM reasoning backbone to detect authenticity and analyze manipulations, trained in two stages. Empirical results show M-DRUM outperforms state-of-the-art multi-modal detectors and strong LVLMs like GPT-4 and LLaVA, with notable gains in few-shot settings and chain-of-thought reasoning.

Abstract

Fake news becomes a growing threat to information security and public opinion with the rapid sprawl of media manipulation. Therefore, fake news detection attracts widespread attention from academic community. Traditional fake news detection models demonstrate remarkable performance on authenticity binary classification but their ability to reason detailed faked traces based on the news content remains under-explored. Furthermore, due to the lack of external knowledge, the performance of existing methods on fact-related news is questionable, leaving their practical implementation unclear. In this paper, we propose a new multi-media research topic, namely manipulation reasoning. Manipulation reasoning aims to reason manipulations based on news content. To support the research, we introduce a benchmark for fake news detection and manipulation reasoning, referred to as Human-centric and Fact-related Fake News (HFFN). The benchmark highlights the centrality of human and the high factual relevance, with detailed manual annotations. HFFN encompasses four realistic domains with fake news samples generated through three manipulation approaches. Moreover, a Multi-modal news Detection and Reasoning langUage Model (M-DRUM) is presented not only to judge on the authenticity of multi-modal news, but also raise analytical reasoning about potential manipulations. On the feature extraction level, a cross-attention mechanism is employed to extract fine-grained fusion features from multi-modal inputs. On the reasoning level, a large vision-language model (LVLM) serves as the backbone to facilitate fact-related reasoning. A two-stage training framework is deployed to better activate the capacity of identification and reasoning. Comprehensive experiments demonstrate that our model outperforms state-of-the-art (SOTA) fake news detection models and powerful LVLMs like GPT-4 and LLaVA.

Fake News Detection and Manipulation Reasoning via Large Vision-Language Models

TL;DR

This work tackles fake news detection in a multi-modal setting by introducing manipulation reasoning as a core capability. It presents the Human-centric and Fact-related Fake News (HFFN) benchmark to stress human faces and factual accuracy, paired with three manipulation types across four domains, and a rich annotation scheme. It then proposes M-DRUM, a large vision-language model that combines multi-modal feature extraction (via ImageBind and a facial encoder), cross-modal fusion, and a prompt-learning-based LVLM reasoning backbone to detect authenticity and analyze manipulations, trained in two stages. Empirical results show M-DRUM outperforms state-of-the-art multi-modal detectors and strong LVLMs like GPT-4 and LLaVA, with notable gains in few-shot settings and chain-of-thought reasoning.

Abstract

Fake news becomes a growing threat to information security and public opinion with the rapid sprawl of media manipulation. Therefore, fake news detection attracts widespread attention from academic community. Traditional fake news detection models demonstrate remarkable performance on authenticity binary classification but their ability to reason detailed faked traces based on the news content remains under-explored. Furthermore, due to the lack of external knowledge, the performance of existing methods on fact-related news is questionable, leaving their practical implementation unclear. In this paper, we propose a new multi-media research topic, namely manipulation reasoning. Manipulation reasoning aims to reason manipulations based on news content. To support the research, we introduce a benchmark for fake news detection and manipulation reasoning, referred to as Human-centric and Fact-related Fake News (HFFN). The benchmark highlights the centrality of human and the high factual relevance, with detailed manual annotations. HFFN encompasses four realistic domains with fake news samples generated through three manipulation approaches. Moreover, a Multi-modal news Detection and Reasoning langUage Model (M-DRUM) is presented not only to judge on the authenticity of multi-modal news, but also raise analytical reasoning about potential manipulations. On the feature extraction level, a cross-attention mechanism is employed to extract fine-grained fusion features from multi-modal inputs. On the reasoning level, a large vision-language model (LVLM) serves as the backbone to facilitate fact-related reasoning. A two-stage training framework is deployed to better activate the capacity of identification and reasoning. Comprehensive experiments demonstrate that our model outperforms state-of-the-art (SOTA) fake news detection models and powerful LVLMs like GPT-4 and LLaVA.
Paper Structure (22 sections, 8 equations, 6 figures, 4 tables)

This paper contains 22 sections, 8 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: An illustration of multi-modal fake new detection and manipulation reasoning. We construct Human-centric and Fact-related Fake News(HFFN) benchmark through three approaches of media manipulation. We proposed Multi-modal news Detection and Reasoning langUage Model(M-DRUM) to not only perform authenticity classification but also reason about manipulations.
  • Figure 2: Statistics of HFFN benchmark. (I: Image Manipulation, T: Text Manipulation, F: Factual Manipulation, &: combination of two manipulation types)
  • Figure 3: The architecture of M-DRUM. In M-DRUM, news images and headlines are aligned with a multi-modal encoder and a manipulation-specific facial feature is leveraged to enhance human-centric representation. Fusion features are derived with the cross-attention mechanism. To bridge the gap between the manipulation expertise and the general knowledge of LVLM, a prompt learner is adopted and a LVLM raises authenticity classification and manipulation reasoning. The model is trained under a two-stage framework to strengthen the capacity of identification and reasoning.
  • Figure 4: Performance floating of few-shot learning.
  • Figure 5: Efficacy of chain-of-thought (CoT) reasoning.
  • ...and 1 more figures