MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Yupeng Li; Haorui He; Jin Bai; Dacheng Wen

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Yupeng Li, Haorui He, Jin Bai, Dacheng Wen

TL;DR

The study tackles the limitation of single-source Chinese fake news datasets by introducing MCFEND, a large multi-source benchmark that spans 23,789 Chinese news items from 14 fact-checking agencies across three source groups and includes rich social-context signals. It formulates the problem as a binary classification over multimodal content and social context, and systematically evaluates six baseline models (both content-based and social-context-based) under cross-source, multi-source, and unseen-source settings. Key findings show substantial performance degradation when moving from Weibo-only training to other sources, and demonstrate that multi-source training substantially improves robustness, with notable gains for RoBERTa and CAFE, while modal fusion models generally offer better cross-source resilience. MCFEND is proposed as a practical benchmark to advance Chinese fake news detection in real-world, diverse-source environments, guiding the development of more robust, transferable detection methods.

Abstract

The prevalence of fake news across various online sources has had a significant influence on the public. Existing Chinese fake news detection datasets are limited to news sourced solely from Weibo. However, fake news originating from multiple sources exhibits diversity in various aspects, including its content and social context. Methods trained on purely one single news source can hardly be applicable to real-world scenarios. Our pilot experiment demonstrates that the F1 score of the state-of-the-art method that learns from a large Chinese fake news detection dataset, Weibo-21, drops significantly from 0.943 to 0.470 when the test data is changed to multi-source news data, failing to identify more than one-third of the multi-source fake news. To address this limitation, we constructed the first multi-source benchmark dataset for Chinese fake news detection, termed MCFEND, which is composed of news we collected from diverse sources such as social platforms, messaging apps, and traditional online news outlets. Notably, such news has been fact-checked by 14 authoritative fact-checking agencies worldwide. In addition, various existing Chinese fake news detection methods are thoroughly evaluated on our proposed dataset in cross-source, multi-source, and unseen source ways. MCFEND, as a benchmark dataset, aims to advance Chinese fake news detection approaches in real-world scenarios.

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

TL;DR

Abstract

Paper Structure (23 sections, 3 figures, 10 tables)

This paper contains 23 sections, 3 figures, 10 tables.

Introduction
Preliminaries and Related work
MCFEND Dataset
Overview
Dataset Construction
Group 1: Fact-checking Agencies Data Crawling
Group 2: Cross-lingual Identical News Retrieval
Group 3: Weibo News Collection
Social Context Collection
Post-collection Processing
Comparison of the Three Groups
Experiments
Baselines
Content-based Methods
Social Context-based Methods
...and 8 more sections

Figures (3)

Figure 1: An example of four pieces of fake news from four different Chinese news sources, including Weibo (a popular social platform), China Times (an online news outlet), Wechat (a messaging app), and Douyin (a social platform). Each piece of fake news showcases different characteristics across various aspects, such as content, topics, publishing methods, linguistic styles, etc.
Figure 2: The process for constructing the MCFEND dataset.
Figure 3: Visualization of textual and social emotion features for news collected from three distinct groups of fact-checking agencies.

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

TL;DR

Abstract

MCFEND: A Multi-source Benchmark Dataset for Chinese Fake News Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (3)