Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

Joshua Ashkinaze; Ruijia Guan; Laura Kurek; Eytan Adar; Ceren Budak; Eric Gilbert

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

Joshua Ashkinaze, Ruijia Guan, Laura Kurek, Eytan Adar, Ceren Budak, Eric Gilbert

TL;DR

This study probes whether general-purpose LLMs can be steered to follow Wikipedia's nuanced Neutral Point of View (NPOV) using high-level rules alone, addressing both detection of biased edits and generation of neutral rewrites. Across a multi-model, multi-prompt setup on the Wikipedia Neutrality Corpus, LLMs show limited success at detecting neutrality (best accuracy around 0.63) and reveal model-specific priors, while their generated rewrites exhibit high recall but low precision relative to human editors. Human evaluation suggests crowdworkers prefer AI rewrites for neutrality and fluency, even as AI edits diverge from editor norms by adding extraneous content; qualitative analyses reveal AI can be “NPOV+” but may also over-edit. The findings highlight tradeoffs for Wikipedia, model builders, and policy makers: LLMs can provide useful neutral drafting with human oversight and smarter prompting (e.g., retrieval-augmented generation, expert fine-tuning), but relying on them for automatic detection or to mimic community editors risks misalignment with editorial norms and increased moderation burden. Overall, high-level rule prompts are insufficient to fully replicate expert community judgments, underscoring the need for mixed-initiative systems and careful stakeholder-aligned evaluation when deploying LLM-based moderation tools.

Abstract

Large language models (LLMs) are trained on broad corpora and then used in communities with specialized norms. Is providing LLMs with community rules enough for models to follow these norms? We evaluate LLMs' capacity to detect (Task 1) and correct (Task 2) biased Wikipedia edits according to Wikipedia's Neutral Point of View (NPOV) policy. LLMs struggled with bias detection, achieving only 64% accuracy on a balanced dataset. Models exhibited contrasting biases (some under- and others over-predicted bias), suggesting distinct priors about neutrality. LLMs performed better at generation, removing 79% of words removed by Wikipedia editors. However, LLMs made additional changes beyond Wikipedia editors' simpler neutralizations, resulting in high-recall but low-precision editing. Interestingly, crowdworkers rated AI rewrites as more neutral (70%) and fluent (61%) than Wikipedia-editor rewrites. Qualitative analysis found LLMs sometimes applied NPOV more comprehensively than Wikipedia editors but often made extraneous non-NPOV-related changes (such as grammar). LLMs may apply rules in ways that resonate with the public but diverge from community experts. While potentially effective for generation, LLMs may reduce editor agency and increase moderation workload (e.g., verifying additions). Even when rules are easy to articulate, having LLMs apply them like community members may still be difficult.

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

TL;DR

Abstract

Paper Structure (51 sections, 12 figures, 15 tables)

This paper contains 51 sections, 12 figures, 15 tables.

Introduction
Findings
Related Work
Wikipedia's Neutral Point of View (NPOV) Policy
Automated Approaches to NPOV and Wikpedia Moderation
Pre-Trained LLMs for Community Content Moderation
LLM Bias Detection
Dataset
Experiment Setup
Factor 1: Definitions Provided
Factor 2: Examples Provided
Experiment Results
LLM Self-Optimizations
Model-Level Analysis
Edit-Level Analysis
...and 36 more sections

Figures (12)

Figure 1: Graphical summaries of model performance and predictions.
Figure 2: Comparison of model performance using confusion matrices and binomial distribution tests.
Figure 3: Edit difficulty was bimodal and models were more accurate for biased edits.
Figure 3: Statistics of AI edit intensity with mean and SD in parentheses. Edit distance is the normalized edit distance between the NPOV-violating text and the neutralization. 'N Changes' is the number of words (excluding stopwords) that the edit changed (i.e., additions plus removals).
Figure 4: Words in an explanation with the most negative and most positive logit coefficients after a TF-IDF logistic regression predicting accuracy. Positive coefficients are associated with accuracy.
...and 7 more figures

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

TL;DR

Abstract

Seeing Like an AI: How LLMs Apply (and Misapply) Wikipedia Neutrality Norms

Authors

TL;DR

Abstract

Table of Contents

Figures (12)