Table of Contents
Fetching ...

Fake News Detection: It's All in the Data!

Soveatin Kuntur, Anna Wróblewska, Marcin Paprzycki, Maria Ganzha

TL;DR

The paper addresses the central challenge of fake news detection by examining how dataset quality, diversity, and labeling influence model performance. It provides a comprehensive taxonomy of data types (textual, visual, multimodal, and generative machine text), analyzes common features, and discusses biases and ethical considerations that shape detector robustness. Key contributions include a synthesis of dataset characteristics, a critique of annotation practices, best-practice guidelines for dataset construction, and a public GitHub portal aggregating datasets to foster reproducibility and collaboration. The work highlights the rising importance of multimodal data, continuous dataset updates, and synthetic data as part of a forward-looking agenda for robust, generalizable fake news detection systems.

Abstract

This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, and prevalent biases that can impact model performance. Additionally, it addresses critical ethical issues and best practices, offering a thorough overview of the current state of available datasets. Our contribution to this field is further enriched by the provision of GitHub repository, which consolidates publicly accessible datasets into a single, user-friendly portal. This repository is designed to facilitate and stimulate further research and development efforts aimed at combating the pervasive issue of fake news.

Fake News Detection: It's All in the Data!

TL;DR

The paper addresses the central challenge of fake news detection by examining how dataset quality, diversity, and labeling influence model performance. It provides a comprehensive taxonomy of data types (textual, visual, multimodal, and generative machine text), analyzes common features, and discusses biases and ethical considerations that shape detector robustness. Key contributions include a synthesis of dataset characteristics, a critique of annotation practices, best-practice guidelines for dataset construction, and a public GitHub portal aggregating datasets to foster reproducibility and collaboration. The work highlights the rising importance of multimodal data, continuous dataset updates, and synthetic data as part of a forward-looking agenda for robust, generalizable fake news detection systems.

Abstract

This comprehensive survey serves as an indispensable resource for researchers embarking on the journey of fake news detection. By highlighting the pivotal role of dataset quality and diversity, it underscores the significance of these elements in the effectiveness and robustness of detection models. The survey meticulously outlines the key features of datasets, various labeling systems employed, and prevalent biases that can impact model performance. Additionally, it addresses critical ethical issues and best practices, offering a thorough overview of the current state of available datasets. Our contribution to this field is further enriched by the provision of GitHub repository, which consolidates publicly accessible datasets into a single, user-friendly portal. This repository is designed to facilitate and stimulate further research and development efforts aimed at combating the pervasive issue of fake news.
Paper Structure (61 sections, 2 tables)