Table of Contents
Fetching ...

Community Notes are Vulnerable to Rater Bias and Manipulation

Bao Tran Truong, Siqi Wu, Alessandro Flammini, Filippo Menczer, Alexander J. Stewart

TL;DR

It is found that while community-driven moderation may offer scalability, its vulnerability to bias and manipulation raises concerns about reliability and trustworthiness, highlighting the need for improved mechanisms to safeguard the integrity of crowdsourced fact-checking.

Abstract

Social media platforms increasingly rely on crowdsourced moderation systems like Community Notes to combat misinformation at scale. However, these systems face challenges from rater bias and potential manipulation, which may undermine their effectiveness. Here we systematically evaluate the Community Notes algorithm using simulated data that models realistic rater and note behaviors, quantifying error rates in publishing helpful versus unhelpful notes. We find that the algorithm suppresses a substantial fraction of genuinely helpful notes and is highly sensitive to rater biases, including polarization and in-group preferences. Moreover, a small minority (5--20\%) of bad raters can strategically suppress targeted helpful notes, effectively censoring reliable information. These findings suggest that while community-driven moderation may offer scalability, its vulnerability to bias and manipulation raises concerns about reliability and trustworthiness, highlighting the need for improved mechanisms to safeguard the integrity of crowdsourced fact-checking.

Community Notes are Vulnerable to Rater Bias and Manipulation

TL;DR

It is found that while community-driven moderation may offer scalability, its vulnerability to bias and manipulation raises concerns about reliability and trustworthiness, highlighting the need for improved mechanisms to safeguard the integrity of crowdsourced fact-checking.

Abstract

Social media platforms increasingly rely on crowdsourced moderation systems like Community Notes to combat misinformation at scale. However, these systems face challenges from rater bias and potential manipulation, which may undermine their effectiveness. Here we systematically evaluate the Community Notes algorithm using simulated data that models realistic rater and note behaviors, quantifying error rates in publishing helpful versus unhelpful notes. We find that the algorithm suppresses a substantial fraction of genuinely helpful notes and is highly sensitive to rater biases, including polarization and in-group preferences. Moreover, a small minority (5--20\%) of bad raters can strategically suppress targeted helpful notes, effectively censoring reliable information. These findings suggest that while community-driven moderation may offer scalability, its vulnerability to bias and manipulation raises concerns about reliability and trustworthiness, highlighting the need for improved mechanisms to safeguard the integrity of crowdsourced fact-checking.

Paper Structure

This paper contains 10 sections, 2 equations, 19 figures, 1 table.

Figures (19)

  • Figure 1: Experimental design. We simulate raters and notes using the same assumptions as the Community Notes algorithm. Each note $n$ has two true parameters: helpfulness $i_n$ and bias $f_n$. Similarly, each rater $u$ is characterized by true friendliness $i_u$ and bias $f_u$. These parameters are drawn from distributions that reflect real-world features of the populations that write and rate notes, such as rater polarization and opinion diversity. Raters assign ratings to notes based on the helpfulness and bias of the note, as well as on their own friendliness and bias. To study the effects of manipulation, we distinguish between "good" and "bad" raters. Good raters behave as intended by the Community Notes algorithm --- they assign higher ratings to notes they perceive as helpful. In contrast, a minority of bad raters intentionally down-rate helpful notes. The Community Notes algorithm uses the simulated ratings we generate to estimate inferred parameters for notes, $\hat{i}_n$ and $\hat{f}_n$, and raters, $\hat{i}_u$ and $\hat{f}_u$. It then uses the inferred note parameters to decide whether a note should be published. In our evaluation, we compare the algorithm's publication decisions using inferred parameters to those based on the true parameters ($i_n$ and $f_n$), representing the ideal case of perfect parameter inference.
  • Figure 2: Varying the distributions of raters and notes. We systematically varied the underlying distributions of note parameters, $i_n$ and $f_n$, and rater parameters, $i_u$ and $f_u$, in populations where all raters are honest. (a) Effects of rater polarization $\rho_u$ with unpolarized notes. (b) Effects of variability of rater friendliness, as measured by standard deviation $\sigma^I_u$, in unpolarized populations. (c) Effects of note polarization $\rho_n$ with unpolarized raters. (d) Effects of in-group bias $E_h$ when choosing which notes to rate in unpolarized populations. Plots show the mean and standard errors across 50 replicate datasets of 20,000 notes and 10,750 raters, with the distribution of ratings chosen to replicate the real-world distribution (see \ref{['sec:methods']}). In all cases we set $\mu^I_n = \mu^I_u = 0.25$ and $\sigma^I_n = 0.5$, which reproduces the empirical frequency of HELPFUL ratings.
  • Figure 3: Indiscriminate bad raters. We systematically varied the percentage of indiscriminate bad raters and the frequency with which they mis-rate notes they perceive as helpful. (a) The suppression rate remains relatively low until the percentage of bad raters reaches 10--15%, along with a frequency of bad rater behavior around 0.8, after which the suppression rate reaches 100%, indicating that all truly helpful notes go unpublished. (b) A similar pattern occurs for the pollution rate, indicating that none of the published notes are truly helpful. (c) The helpfulness filter tends to successfully remove bad raters when there are sufficiently few ($<10\%$) and/or the frequency of bad rater behavior is sufficiently low ($<0.8$). However the filter breaks down at the same time as the suppression and pollution rate approach 100%. Results shown are for a single replicate with 20,000 notes and 10,750 raters, for each choice of bad rater percentage and frequency of bad rater frequency. Parameters are as described in Figure \ref{['fig:variability']} and the main text, for unpolarized populations ($\rho_n=\rho_u=0$), and with no in-group bias ($E_h=0$).
  • Figure 4: Coordinated bad raters. We systematically varied the percentage of coordinated bad raters, who mis-rate notes they perceive as helpful only if those notes belong to a targeted group. Here the frequency of bad rater behavior is set to one (we explore the more general case in Supplementary Material). We randomly set the targeted group to be either notes with $f_n<0$ or notes with $f_n\geq0$ for each replicate dataset. (a) The suppression rate diverges for targeted and non-targeted notes once the percentage of bad raters reaches approximately 5%, and the suppression rate for targeted notes reaches 100% once the percentage of bad raters reaches approximately 20%, while the suppression rate of non-targeted notes remains unchanged. (b) The pollution rate for targeted and non-targeted notes diverges once the percentage of bad raters reaches approximately 15% and reaches 100% once the percentage of bad raters reaches approximately 25%, while the pollution rate for non-targeted notes again remains largely unchanged. (c) The publication rate of targeted notes declines from around 20% when there are no bad raters to around 1% when bad raters exceed 20%, while the publication rate of non-targeted notes remains largely unchanged. Results shown are the mean and standard error for 100 replicates with 20,000 notes and 10,750 raters, for each choice of bad rater percentage. Parameters are as described in Figure \ref{['fig:variability']} and the main text, for unpolarized populations ($\rho_n=\rho_u=0$) and with no in-group bias ($E_h=0$).
  • Figure 5: Interaction between rater bias and critical percentage of bad raters. We calculated the percentage of coordinated bad raters required to raise the suppression rate (yellow) and pollution rate (pink) to 90% under different experimental conditions. The frequency of bad rater behavior is set to one. For all raters, we considered cases with out-group bias ($E_h=-1$), in-group bias ($E_h=1$), and neither ($E_h=0$). We combined these cases with the presence of rater polarization ($\rho_u=1$) or its absence ($\rho_u=0$). Results shown are the means and standard errors across 100 replicates with 20,000 notes and 10,750 raters. Parameters are as described in Figure \ref{['fig:variability']} and the main text, for populations without note polarization ($\rho_n=0$).
  • ...and 14 more figures