Table of Contents
Fetching ...

AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Haiwen Li, Michiel A. Bakker

Abstract

Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.

AI Fact-Checking in the Wild: A Field Evaluation of LLM-Written Community Notes on X

Abstract

Large language models show promising capabilities for contextual fact-checking on social media: they can verify contested claims through deep research, synthesize evidence from multiple sources, and draft explanations at scale. However, prior work evaluates LLM fact-checking only in controlled settings using benchmarks or crowdworker judgments, leaving open how these systems perform in authentic platform environments. We present the first field evaluation of LLM-based fact-checking deployed on a live social media platform, testing performance directly through X Community Notes' AI writer feature over a three-month period. Our LLM writer, a multi-step pipeline that handles multimodal content (text, images, and videos), conducts web and platform-native search, and writes contextual notes, was deployed to write 1,614 notes on 1,597 tweets and compared against 1,332 human-written notes on the same tweets using 108,169 ratings from 42,521 raters. Direct comparison of note-level platform outcomes is complicated by differences in submission timing and rating exposure between LLM and human notes; we therefore pursue two complementary strategies: a rating-level analysis modeling individual rater evaluations, and a note-level analysis that equalizes rater exposure across note types. Rating-level analysis shows that LLM notes receive more positive ratings than human notes across raters with different political viewpoints, suggesting the potential for LLM-written notes to achieve the cross-partisan consensus. Note-level analysis confirms this advantage: among raters who evaluated all notes on the same post, LLM notes achieve significantly higher helpfulness scores. Our findings demonstrate that LLMs can contribute high-quality, broadly helpful fact-checking at scale, while highlighting that real-world evaluation requires careful attention to platform dynamics absent from controlled settings.

Paper Structure

This paper contains 20 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Mean % helpful and % unhelpful ratings per note for LLM and human notes, stratified by rater ideology group (left, neutral, right). Error bars show 95% confidence intervals across notes.
  • Figure 2: LLM vs. human note rating advantage (AI main-effect coefficient with 95% CI) by tweet modality (text-only, image, video).
  • Figure 3: LLM vs. human note rating advantage (AI main-effect coefficient with 95% CI) by tweet topic category.
  • Figure A1: Distribution of rater characteristics for the full rater population vs. complete raters who evaluated all notes on a given tweet. Left: coreRaterIntercept captures baseline helpfulness leniency. Right: coreRaterFactor1 captures political leaning (negative = left-leaning, positive = right-leaning). The close overlap indicates that complete raters are not systematically different from the overall rater population.