Table of Contents
Fetching ...

Fake News Detection via Wisdom of Synthetic & Representative Crowds

François t'Serstevens, Roberto Cerina, Giulia Piccillo

TL;DR

The paper tackles the democratic legitimacy gap in fake-news detection by combining wisdom of crowds with hierarchical Bayesian modeling and post-stratification to produce population-representative veracity scores and state-level share-risk estimates. It contrasts naive crowd estimates with model-based estimates that learn from crowd demographics and tweet context via an ordinal logistic model, then post-stratifies to population personae using MrP, yielding metrics such as model-population and model-balance that converge in interpretation. The results reveal that fake-news sharing is generally rare but shows partisan patterns that depend on how fake news is defined, with Democrats consistently less likely to share under standard metrics and substantial state-level heterogeneity. The approach provides a scalable, transparent framework for democratic fake-news moderation, offering actionable population- and state-level insights and highlighting the value of incorporating uncertainty and representativeness in crowd-based assessments.

Abstract

Social media companies have struggled to provide a democratically legitimate definition of "Fake News". Reliance on expert judgment has attracted criticism due to a general trust deficit and political polarisation. Approaches reliant on the ``wisdom of the crowds'' are a cost-effective, transparent and inclusive alternative. This paper provides a novel end-to-end methodology to detect fake news on X via "wisdom of the synthetic & representative crowds". We deploy an online survey on the Lucid platform to gather veracity assessments for a number of pandemic-related tweets from crowd-workers. Borrowing from the MrP literature, we train a Hierarchical Bayesian model to predict the veracity of each tweet from the perspective of different personae from the population of interest. We then weight the predicted veracity assessments according to a representative stratification frame, such that decisions about ``fake'' tweets are representative of the overall polity of interest. Based on these aggregated scores, we analyse a corpus of tweets and perform a second MrP to generate state-level estimates of the number of people who share fake news. We find small but statistically meaningful heterogeneity in fake news sharing across US states. At the individual-level: i. sharing fake news is generally rare, with an average sharing probability interval [0.07,0.14]; ii. strong evidence that Democrats share less fake news, accounting for a reduction in the sharing odds of [57.3%,3.9%] relative to the average user; iii. when Republican definitions of fake news are used, it is the latter who show a decrease in the propensity to share fake news worth [50.8%, 2.0%]; iv. some evidence that women share less fake news than men, an effect worth a [29.5%,4.9%] decrease.

Fake News Detection via Wisdom of Synthetic & Representative Crowds

TL;DR

The paper tackles the democratic legitimacy gap in fake-news detection by combining wisdom of crowds with hierarchical Bayesian modeling and post-stratification to produce population-representative veracity scores and state-level share-risk estimates. It contrasts naive crowd estimates with model-based estimates that learn from crowd demographics and tweet context via an ordinal logistic model, then post-stratifies to population personae using MrP, yielding metrics such as model-population and model-balance that converge in interpretation. The results reveal that fake-news sharing is generally rare but shows partisan patterns that depend on how fake news is defined, with Democrats consistently less likely to share under standard metrics and substantial state-level heterogeneity. The approach provides a scalable, transparent framework for democratic fake-news moderation, offering actionable population- and state-level insights and highlighting the value of incorporating uncertainty and representativeness in crowd-based assessments.

Abstract

Social media companies have struggled to provide a democratically legitimate definition of "Fake News". Reliance on expert judgment has attracted criticism due to a general trust deficit and political polarisation. Approaches reliant on the ``wisdom of the crowds'' are a cost-effective, transparent and inclusive alternative. This paper provides a novel end-to-end methodology to detect fake news on X via "wisdom of the synthetic & representative crowds". We deploy an online survey on the Lucid platform to gather veracity assessments for a number of pandemic-related tweets from crowd-workers. Borrowing from the MrP literature, we train a Hierarchical Bayesian model to predict the veracity of each tweet from the perspective of different personae from the population of interest. We then weight the predicted veracity assessments according to a representative stratification frame, such that decisions about ``fake'' tweets are representative of the overall polity of interest. Based on these aggregated scores, we analyse a corpus of tweets and perform a second MrP to generate state-level estimates of the number of people who share fake news. We find small but statistically meaningful heterogeneity in fake news sharing across US states. At the individual-level: i. sharing fake news is generally rare, with an average sharing probability interval [0.07,0.14]; ii. strong evidence that Democrats share less fake news, accounting for a reduction in the sharing odds of [57.3%,3.9%] relative to the average user; iii. when Republican definitions of fake news are used, it is the latter who show a decrease in the propensity to share fake news worth [50.8%, 2.0%]; iv. some evidence that women share less fake news than men, an effect worth a [29.5%,4.9%] decrease.
Paper Structure (22 sections, 12 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 22 sections, 12 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: The y-axis represents the census / ground-truth proportions and the x-axis represents the Lucid and $\mathbb{X}$ sample proportions. Any point on the blue line represents a matching proportion. Points that are to the left/right of the blue line are under-/over- represented.
  • Figure 2: Difference (in logit-scale predicted values) between Democrat and Republican predispositions towards the selected context annotations. Negative/Positive values respectively indicate that Republicans/Democrats are more likely to rate the topic as 'true'. Note that most tweets tend to have negative sentiment, hence the partisan divide is often driven by believing negative facts about a political opponent are true.
  • Figure 3: The upper half of the matrix indicates the correlation coefficients; the lower half the p-values of the respective correlations. The yellow rectangle highlights the wisdom of the crowd metrics, non-highlighted metrics are unrepresentative of the crowd by design.
  • Figure 4: Posterior distribution of the party random-effects on fake news sharing estimation. Negative values indicate a reduced likelihood of posting fake news. The legends identify the veracity score used to estimate a given effect.
  • Figure 5: Posterior distribution of state-level % of individuals who share fake news.
  • ...and 1 more figures