Auditing the Auditors: Does Community-based Moderation Get It Right?

Yeganeh Alimohammadi; Karissa Huang; Christian Borgs; Jennifer Chayes

Auditing the Auditors: Does Community-based Moderation Get It Right?

Yeganeh Alimohammadi, Karissa Huang, Christian Borgs, Jennifer Chayes

Abstract

Online social platforms increasingly rely on crowd-sourced systems to label misleading content at scale, but these systems must both aggregate users' evaluations and decide whose evaluations to trust. To address the latter, many platforms audit users by rewarding agreement with the final aggregate outcome, a design we term consensus-based auditing. We analyze the consequences of this design in X's Community Notes, which in September 2022 adopted consensus-based auditing that ties users' eligibility for participation to agreement with the eventual platform outcome. We find evidence of strategic conformity: minority contributors' evaluations drift toward the majority and their participation share falls on controversial topics, where independent signals matter most. We formalize this mechanism in a behavioral model in which contributors trade off private beliefs against anticipated penalties for disagreement. Motivated by these findings, we propose a two-stage auditing and aggregation algorithm that weights contributors by the stability of their past residuals rather than by agreement with the majority. The method first accounts for differences across content and contributors, and then measures how predictable each contributor's evaluations are relative to the latent-factor model. Contributors whose evaluations are consistently informative receive greater influence in aggregation, even when they disagree with the prevailing consensus. In the Community Notes data, this approach improves out-of-sample predictive performance while avoiding penalization of disagreement.

Auditing the Auditors: Does Community-based Moderation Get It Right?

Abstract

Paper Structure (40 sections, 17 theorems, 213 equations, 17 figures, 11 tables)

This paper contains 40 sections, 17 theorems, 213 equations, 17 figures, 11 tables.

Guide to the Appendix
Data, reconstruction, and empirical methodology
Data sources
Reconstructing Weekly Latent Factors
Policy Timing, and User Cohorts
Additional Empirical Results
Robustness Checks for Minority Behavior Shift
Latent Factor Distribution Shift
RDD Tests for Bimodality
Regression discontinuity design.
Results and interpretation.
Additional Alignment Tests
Logistic Regression
DiD for Note Helpfulness
Controversial Content and Participation
...and 25 more sections

Key Result

Theorem 1

Assume $U,N\to\infty$ and that $\mathbb{E}[f_u] = \mu_f$ where $\mu_f$ is known. Then, the estimate for note helpfulness is consistent. In particular That is, in the truthful regime, rank-1 MF recovers the true note helpfulness.

Figures (17)

Figure 1: Rater factor change
Figure 2: Note factor change
Figure 4: Rolling Spearman correlation between rater-note factor dot product alignment and helpfulness ratings for the cohort of $1,202$ early users, computed over a sliding window of 50 ratings sorted by date. The red dashed line marks the algorithmic change on October 1, 2022. The bold line shows a LOWESS smooth of the rolling correlation. Prior to the intervention, the correlation is relatively stable, whereas after the intervention, the correlation declines steadily. This is an observational indicator that the rollout of Rating Impact weakened the relationship between rater-note alignment and helpfulness ratings among the group of users who were active before the change.
Figure 5: Pre–post change in the share of notes with final status Helpful by controversy category around the Rating Impact rollout (cutoff: 2022-10-01). Bars show the mean proportion in the approximately 20 weeks before (Pre, blue) and after (Post, orange) the cutoff; error bars are 95% CIs. Text above bars reports the Post–Pre difference in percentage points. The increase is larger for non-controversial notes (+13.7 pp) than for controversial notes (+6.5 pp).
Figure 6: This figure shows the weekly mean squared error (MSE) for in-sample vs. out-of-sample predictions from the matrix factorization model. The MSE is computed as the squared difference between the observed rating outcomes and the model’s predicted rating outcomes (pre-discretization). In-sample errors reflect fit to the same week’s ratings, while out-of-sample errors use factors estimated from week $t$ to predict ratings in week $t+1$. The vertical dashed line marks the Rating Impact analysis date we use.
...and 12 more figures

Theorems & Definitions (35)

Definition 1
Definition 2: User's Utility
Theorem 1
Theorem 2
Theorem 3
Proposition 4
Proposition 5
Remark 6
Theorem 7
Lemma E.1
...and 25 more

Auditing the Auditors: Does Community-based Moderation Get It Right?

Abstract

Auditing the Auditors: Does Community-based Moderation Get It Right?

Authors

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (35)