Table of Contents
Fetching ...

FKA-Owl: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs

Xuannan Liu, Peipei Li, Huaibo Huang, Zekun Li, Xing Cui, Jiahao Liang, Lixiong Qin, Weihong Deng, Zhaofeng He

TL;DR

This work tackles open-world multimodal fake news detection by addressing domain shift in MFND. It introduces FKA-Owl, a forgery-knowledge augmented LVLM that adds two lightweight modules for semantic correlation and artifact trace reasoning, projecting their embeddings into the LVLM’s language space. The method employs MFND instruction-following data, candidate-answer heuristics, and soft prompts to activate LVLM knowledge, achieving superior cross-domain performance on DGM^4 and NewsCLIPpings compared with strong baselines. The results demonstrate that combining world knowledge from LVLMs with forgery-specific cues yields better generalization for detecting manipulated image–text pairs in real-world settings.

Abstract

The massive generation of multimodal fake news involving both text and images exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training restricts the capability of classical detectors to obtain open-world facts. While Large Vision-Language Models (LVLMs) have encoded rich world knowledge, they are not inherently tailored for combating fake news and struggle to comprehend local forgery details. In this paper, we propose FKA-Owl, a novel framework that leverages forgery-specific knowledge to augment LVLMs, enabling them to reason about manipulations effectively. The augmented forgery-specific knowledge includes semantic correlation between text and images, and artifact trace in image manipulation. To inject these two kinds of knowledge into the LVLM, we design two specialized modules to establish their representations, respectively. The encoded knowledge embeddings are then incorporated into LVLMs. Extensive experiments on the public benchmark demonstrate that FKA-Owl achieves superior cross-domain performance compared to previous methods. Code is publicly available at https://liuxuannan.github.io/FKA_Owl.github.io/.

FKA-Owl: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs

TL;DR

This work tackles open-world multimodal fake news detection by addressing domain shift in MFND. It introduces FKA-Owl, a forgery-knowledge augmented LVLM that adds two lightweight modules for semantic correlation and artifact trace reasoning, projecting their embeddings into the LVLM’s language space. The method employs MFND instruction-following data, candidate-answer heuristics, and soft prompts to activate LVLM knowledge, achieving superior cross-domain performance on DGM^4 and NewsCLIPpings compared with strong baselines. The results demonstrate that combining world knowledge from LVLMs with forgery-specific cues yields better generalization for detecting manipulated image–text pairs in real-world settings.

Abstract

The massive generation of multimodal fake news involving both text and images exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training restricts the capability of classical detectors to obtain open-world facts. While Large Vision-Language Models (LVLMs) have encoded rich world knowledge, they are not inherently tailored for combating fake news and struggle to comprehend local forgery details. In this paper, we propose FKA-Owl, a novel framework that leverages forgery-specific knowledge to augment LVLMs, enabling them to reason about manipulations effectively. The augmented forgery-specific knowledge includes semantic correlation between text and images, and artifact trace in image manipulation. To inject these two kinds of knowledge into the LVLM, we design two specialized modules to establish their representations, respectively. The encoded knowledge embeddings are then incorporated into LVLMs. Extensive experiments on the public benchmark demonstrate that FKA-Owl achieves superior cross-domain performance compared to previous methods. Code is publicly available at https://liuxuannan.github.io/FKA_Owl.github.io/.
Paper Structure (36 sections, 12 equations, 8 figures, 7 tables)

This paper contains 36 sections, 12 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Illustration of the effect of forgery-knowledge augmentation. (a) An example of a manipulated image-text pair in which Trump's face is swapped with another person and the positive words "accept an award" is replaced with the negative "lost an argument". (b) Existing LVLMs struggle to correctly judge the news veracity. (c) Incorporating forgery-specific knowledge (i.e., semantic correlation and artifact trace) into LVLM helps the model make accurate predictions.
  • Figure 2: Architecture of our proposed FKA-Owl, which is built upon the off-the-shelf LVLM consisting of an image encoder and the LLM. Given a manipulated image-text pair, the cross-modal reasoning module (a) first extracts cross-modal semantic embeddings and visual patch features. Then, these visual patch features are processed by the visual-artifact localization module (b) to encode precise artifact embeddings. Finally, the semantic and artifact embeddings are incorporated into the forgery-aware vision-language model (c) combined with image features and the human prompt for deep manipulation reasoning.
  • Figure 3: Ablation study of the world knowledge inherent in large vision-language models.
  • Figure 4: Ablation study of the potential module choice of using pre-trained artifact detector to replace visual-artifact localization module.
  • Figure A5: Lists of two state-level prompts considered in this paper to construct contrastive class prompts.
  • ...and 3 more figures