Table of Contents
Fetching ...

Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale

Hans W. A. Hanley, Deepak Kumar, Zakir Durumeric

TL;DR

This work tackles the scalable tracking of misinformation narratives across a large corpus of unreliable news sites and fringe forums. It introduces an NLP pipeline that fine-tunes MPNet with contrastive learning, embeds passages, and uses a DP-Means-based clustering to identify 52,036 discrete narratives, enabling real-time monitoring of how narratives originate and propagate. The study analyzes narrative origins, amplification, and cross-platform dynamics (including 8kun and 4chan), and demonstrates potential to accelerate fact-checking by surfacing emerging narratives before they peak. By releasing code and data, the authors provide a practical tool for researchers and journalists to detect and respond to misinformation at scale.

Abstract

Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,334 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically identify and track the narratives spread within online ecosystems. Identifying 52,036 narratives on these 1,334 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and amplify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and to aid fact-checkers in more quickly addressing misinformation. We release code and data at https://github.com/hanshanley/specious-sites.

Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale

TL;DR

This work tackles the scalable tracking of misinformation narratives across a large corpus of unreliable news sites and fringe forums. It introduces an NLP pipeline that fine-tunes MPNet with contrastive learning, embeds passages, and uses a DP-Means-based clustering to identify 52,036 discrete narratives, enabling real-time monitoring of how narratives originate and propagate. The study analyzes narrative origins, amplification, and cross-platform dynamics (including 8kun and 4chan), and demonstrates potential to accelerate fact-checking by surfacing emerging narratives before they peak. By releasing code and data, the authors provide a practical tool for researchers and journalists to detect and respond to misinformation at scale.

Abstract

Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,334 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically identify and track the narratives spread within online ecosystems. Identifying 52,036 narratives on these 1,334 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and amplify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and to aid fact-checkers in more quickly addressing misinformation. We release code and data at https://github.com/hanshanley/specious-sites.
Paper Structure (38 sections, 4 equations, 7 figures, 10 tables)

This paper contains 38 sections, 4 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Our pipeline for identifying and labeling narrative clusters from the daily publications of unreliable news websites.
  • Figure 2: Evaluation of our model's precision, recall, and $F_1$ scores on the English portion of the SemEval22 dataset goel2022semeval (using 3.0 as the cut-off for the two articles being about the same event hanley2022partial).
  • Figure 3: Passage pair at our selected similarity threshold (0.60).
  • Figure 4: Article volume of popular narratives from January 1, 2022, to November 1, 2022.
  • Figure 5: Volume over time for case-study narratives of Ukrainian Nazis, Killer COVID-19 vaccines, and 2020 Election Denialism.
  • ...and 2 more figures