Table of Contents
Fetching ...

Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

Hsuvas Borkakoty, Luis Espinosa-Anke

TL;DR

HoaxPedia presents a text-centric benchmark for detecting Wikipedia hoaxes by unifying 311 confirmed hoax articles with about 30,000 semantically similar legitimate articles. The study systematically compares surface-level features and demonstrates that while these cues are similar across hoax and legitimate articles, revision-history signals offer stronger discrimination. A broad set of experiments across BERT-family, Longformer, T5, and large language models reveals that full-text content generally yields higher performance than using just the first sentence, with Longformer achieving around 0.8 F1 in full-text settings, and RoBERTa-based models performing consistently in definition-only setups. The work provides a practical dataset and benchmarks that inform future text-based disinformation detection on Wikipedia and suggests future work on editor-based signals and balanced training to handle data imbalance.

Abstract

Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.

Hoaxpedia: A Unified Wikipedia Hoax Articles Dataset

TL;DR

HoaxPedia presents a text-centric benchmark for detecting Wikipedia hoaxes by unifying 311 confirmed hoax articles with about 30,000 semantically similar legitimate articles. The study systematically compares surface-level features and demonstrates that while these cues are similar across hoax and legitimate articles, revision-history signals offer stronger discrimination. A broad set of experiments across BERT-family, Longformer, T5, and large language models reveals that full-text content generally yields higher performance than using just the first sentence, with Longformer achieving around 0.8 F1 in full-text settings, and RoBERTa-based models performing consistently in definition-only setups. The work provides a practical dataset and benchmarks that inform future text-based disinformation detection on Wikipedia and suggests future work on editor-based signals and balanced training to handle data imbalance.

Abstract

Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce Hoaxpedia, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article's definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.
Paper Structure (20 sections, 2 equations, 6 figures, 7 tables)

This paper contains 20 sections, 2 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Example of Hoax and Legitimate(Real) article
  • Figure 2: Text length distribution for hoax and legitimate articles.
  • Figure 3: Results of different stylistic analyses on Hoax (red) and legitimate (blue) articles.
  • Figure 4: Timeline based dense region plots for hoax and legitimate articles (with start and end lines for each region and normalized densities marked in grey)
  • Figure 5: Histogram of normalized distribution for number of revisions in dense regions for hoax and legitimate (real) article
  • ...and 1 more figures