Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation
Jiaying Wu, Zihang Fu, Haonan Wang, Fanxiao Li, Min-Yen Kan
TL;DR
The paper addresses delays in crowd-sourced health misinformation governance by introducing CrowdNotes+, an LLM-augmented framework that augments note creation via evidence-grounded augmentation and utility-guided automation, coupled with a hierarchical three-step evaluation (relevance, correctness, helpfulness). It formalizes HealthNotes, a 1,268-sample health-note benchmark evaluated with a domain-tuned HealthJudge, and demonstrates across 15 LLMs that hierarchical evaluation reduces factual errors and that LLM-generated notes can achieve higher factual accuracy and contextual balance than human-written notes when properly grounded. Key contributions include two generation modes, a robust evaluation pipeline, and empirical evidence that LLM-augmented notes improve timeliness and quality of misinformation governance in health contexts. The work supports a hybrid human–AI governance model, highlighting practical implications for scalable, interpretable, and timely crowd-based moderation on social platforms, with planned extensions to languages, domains, and end-to-end deployment pipelines.
Abstract
Community Notes, the crowd-sourced misinformation governance system on X (formerly Twitter), enables users to flag misleading posts, attach contextual notes, and vote on their helpfulness. However, our analysis of 30.8K health-related notes reveals significant latency, with a median delay of 17.6 hours before the first note receives a helpfulness status. To improve responsiveness during real-world misinformation surges, we propose CrowdNotes+, a unified framework that leverages large language models (LLMs) to augment Community Notes for faster and more reliable health misinformation governance. CrowdNotes+ integrates two complementary modes: (1) evidence-grounded note augmentation and (2) utility-guided note automation, along with a hierarchical three-step evaluation that progressively assesses relevance, correctness, and helpfulness. We instantiate the framework through HealthNotes, a benchmark of 1.2K helpfulness-annotated health notes paired with a fine-tuned helpfulness judge. Experiments on fifteen LLMs reveal an overlooked loophole in current helpfulness evaluation, where stylistic fluency is mistaken for factual accuracy, and demonstrate that our hierarchical evaluation and LLM-augmented generation jointly enhance factual precision and evidence utility. These results point toward a hybrid human-AI governance model that improves both the rigor and timeliness of crowd-sourced fact-checking.
