Table of Contents
Fetching ...

Detection and Discovery of Misinformation Sources using Attributed Webgraphs

Peter Carragher, Evan M. Williams, Kathleen M. Carley

TL;DR

The paper tackles the problem of transient misinformation sources by shifting from article- or social-media-based signals to domain-level reliability using attributed webgraphs and SEO features. It introduces MBFC* a new, multi-source, labeled webgraph dataset and applies graph neural networks to predict reliability and political bias, achieving a 0.96 F1 on the PoliticalNews benchmark and providing a competitive, content-agnostic discovery mechanism for new unreliable sources. The work demonstrates that outlink structures and SEO context offer strong predictive power, surpassing prior state-of-the-art on key tasks, and presents a graph-based discovery pipeline that identifies candidate misinformation domains with substantial reliability and bias signals, while acknowledging limitations such as seed-bias and domain survivability. This approach enables scalable, language- and content-agnostic misinformation research with practical implications for detection and platform-level moderation.

Abstract

Website reliability labels underpin almost all research in misinformation detection. However, misinformation sources often exhibit transient behavior, which makes many such labeled lists obsolete over time. We demonstrate that Search Engine Optimization (SEO) attributes provide strong signals for predicting news site reliability. We introduce a novel attributed webgraph dataset with labeled news domains and their connections to outlinking and backlinking domains. We demonstrate the success of graph neural networks in detecting news site reliability using these attributed webgraphs, and show that our baseline news site reliability classifier outperforms current SoTA methods on the PoliticalNews dataset, achieving an F1 score of 0.96. Finally, we introduce and evaluate a novel graph-based algorithm for discovering previously unknown misinformation news sources.

Detection and Discovery of Misinformation Sources using Attributed Webgraphs

TL;DR

The paper tackles the problem of transient misinformation sources by shifting from article- or social-media-based signals to domain-level reliability using attributed webgraphs and SEO features. It introduces MBFC* a new, multi-source, labeled webgraph dataset and applies graph neural networks to predict reliability and political bias, achieving a 0.96 F1 on the PoliticalNews benchmark and providing a competitive, content-agnostic discovery mechanism for new unreliable sources. The work demonstrates that outlink structures and SEO context offer strong predictive power, surpassing prior state-of-the-art on key tasks, and presents a graph-based discovery pipeline that identifies candidate misinformation domains with substantial reliability and bias signals, while acknowledging limitations such as seed-bias and domain survivability. This approach enables scalable, language- and content-agnostic misinformation research with practical implications for detection and platform-level moderation.

Abstract

Website reliability labels underpin almost all research in misinformation detection. However, misinformation sources often exhibit transient behavior, which makes many such labeled lists obsolete over time. We demonstrate that Search Engine Optimization (SEO) attributes provide strong signals for predicting news site reliability. We introduce a novel attributed webgraph dataset with labeled news domains and their connections to outlinking and backlinking domains. We demonstrate the success of graph neural networks in detecting news site reliability using these attributed webgraphs, and show that our baseline news site reliability classifier outperforms current SoTA methods on the PoliticalNews dataset, achieving an F1 score of 0.96. Finally, we introduce and evaluate a novel graph-based algorithm for discovering previously unknown misinformation news sources.
Paper Structure (39 sections, 1 equation, 9 figures, 7 tables, 1 algorithm)

This paper contains 39 sections, 1 equation, 9 figures, 7 tables, 1 algorithm.

Figures (9)

  • Figure 1: Bias label counts grouped by reliability.
  • Figure 2: Backlink network where node colors show reliability labels; red are low reliability, blue are high reliability, pink are mixed reliability, grey are backlinking domains.
  • Figure 3: F1 scores for webgraph design space exploration, using GCN with various network structure and link weighting approaches.
  • Figure 4: Increasing backlink context improves the performance of our models, to a point.
  • Figure 5: SEO attribute importances from Ahrefs on predicting reliability labels on the PoliticalNews and MBFC* datasets. For a full description of features used, see https://ahrefs.com/api/documentation/metrics-extended.
  • ...and 4 more figures