Table of Contents
Fetching ...

"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

William Yang Wang

TL;DR

This paper introduces liar, a large, publicly available dataset of 12.8K PolitiFact statements with six fine-grained truth labels and rich meta-data, addressing the scarcity of labeled fake-news resources. It frames fake-news detection as a 6-way multiclass task and proposes a hybrid CNN that fuses text with metadata embeddings to improve performance over text-only baselines. Empirical results show that incorporating metadata yields modest yet consistent gains, with the best test accuracy around 0.274, demonstrating the value of contextual information for fine-grained deception detection. The dataset supports automatic fact-checking research and broader political NLP tasks such as stance classification and rumor detection, with implications for real-world misinformation mitigation.

Abstract

Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. However, statistical approaches to combating fake news has been dramatically limited by the lack of labeled benchmark datasets. In this paper, we present liar: a new, publicly available dataset for fake news detection. We collected a decade-long, 12.8K manually labeled short statements in various contexts from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. Empirically, we investigate automatic fake news detection based on surface-level linguistic patterns. We have designed a novel, hybrid convolutional neural network to integrate meta-data with text. We show that this hybrid approach can improve a text-only deep learning model.

"Liar, Liar Pants on Fire": A New Benchmark Dataset for Fake News Detection

TL;DR

This paper introduces liar, a large, publicly available dataset of 12.8K PolitiFact statements with six fine-grained truth labels and rich meta-data, addressing the scarcity of labeled fake-news resources. It frames fake-news detection as a 6-way multiclass task and proposes a hybrid CNN that fuses text with metadata embeddings to improve performance over text-only baselines. Empirical results show that incorporating metadata yields modest yet consistent gains, with the best test accuracy around 0.274, demonstrating the value of contextual information for fine-grained deception detection. The dataset supports automatic fact-checking research and broader political NLP tasks such as stance classification and rumor detection, with implications for real-world misinformation mitigation.

Abstract

Automatic fake news detection is a challenging problem in deception detection, and it has tremendous real-world political and social impacts. However, statistical approaches to combating fake news has been dramatically limited by the lack of labeled benchmark datasets. In this paper, we present liar: a new, publicly available dataset for fake news detection. We collected a decade-long, 12.8K manually labeled short statements in various contexts from PolitiFact.com, which provides detailed analysis report and links to source documents for each case. This dataset can be used for fact-checking research as well. Notably, this new dataset is an order of magnitude larger than previously largest public fake news datasets of similar type. Empirically, we investigate automatic fake news detection based on surface-level linguistic patterns. We have designed a novel, hybrid convolutional neural network to integrate meta-data with text. We show that this hybrid approach can improve a text-only deep learning model.

Paper Structure

This paper contains 7 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Some random excerpts from the liar dataset.
  • Figure 2: The proposed hybrid Convolutional Neural Networks framework for integrating text and meta-data.