Table of Contents
Fetching ...

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

Jiaying Wu, Zihang Fu, Haonan Wang, Fanxiao Li, Min-Yen Kan

TL;DR

The paper addresses delays in crowd-sourced health misinformation governance by introducing CrowdNotes+, an LLM-augmented framework that augments note creation via evidence-grounded augmentation and utility-guided automation, coupled with a hierarchical three-step evaluation (relevance, correctness, helpfulness). It formalizes HealthNotes, a 1,268-sample health-note benchmark evaluated with a domain-tuned HealthJudge, and demonstrates across 15 LLMs that hierarchical evaluation reduces factual errors and that LLM-generated notes can achieve higher factual accuracy and contextual balance than human-written notes when properly grounded. Key contributions include two generation modes, a robust evaluation pipeline, and empirical evidence that LLM-augmented notes improve timeliness and quality of misinformation governance in health contexts. The work supports a hybrid human–AI governance model, highlighting practical implications for scalable, interpretable, and timely crowd-based moderation on social platforms, with planned extensions to languages, domains, and end-to-end deployment pipelines.

Abstract

Community Notes, the crowd-sourced misinformation governance system on X (formerly Twitter), enables users to flag misleading posts, attach contextual notes, and vote on their helpfulness. However, our analysis of 30.8K health-related notes reveals significant latency, with a median delay of 17.6 hours before the first note receives a helpfulness status. To improve responsiveness during real-world misinformation surges, we propose CrowdNotes+, a unified framework that leverages large language models (LLMs) to augment Community Notes for faster and more reliable health misinformation governance. CrowdNotes+ integrates two complementary modes: (1) evidence-grounded note augmentation and (2) utility-guided note automation, along with a hierarchical three-step evaluation that progressively assesses relevance, correctness, and helpfulness. We instantiate the framework through HealthNotes, a benchmark of 1.2K helpfulness-annotated health notes paired with a fine-tuned helpfulness judge. Experiments on fifteen LLMs reveal an overlooked loophole in current helpfulness evaluation, where stylistic fluency is mistaken for factual accuracy, and demonstrate that our hierarchical evaluation and LLM-augmented generation jointly enhance factual precision and evidence utility. These results point toward a hybrid human-AI governance model that improves both the rigor and timeliness of crowd-sourced fact-checking.

Beyond the Crowd: LLM-Augmented Community Notes for Governing Health Misinformation

TL;DR

The paper addresses delays in crowd-sourced health misinformation governance by introducing CrowdNotes+, an LLM-augmented framework that augments note creation via evidence-grounded augmentation and utility-guided automation, coupled with a hierarchical three-step evaluation (relevance, correctness, helpfulness). It formalizes HealthNotes, a 1,268-sample health-note benchmark evaluated with a domain-tuned HealthJudge, and demonstrates across 15 LLMs that hierarchical evaluation reduces factual errors and that LLM-generated notes can achieve higher factual accuracy and contextual balance than human-written notes when properly grounded. Key contributions include two generation modes, a robust evaluation pipeline, and empirical evidence that LLM-augmented notes improve timeliness and quality of misinformation governance in health contexts. The work supports a hybrid human–AI governance model, highlighting practical implications for scalable, interpretable, and timely crowd-based moderation on social platforms, with planned extensions to languages, domains, and end-to-end deployment pipelines.

Abstract

Community Notes, the crowd-sourced misinformation governance system on X (formerly Twitter), enables users to flag misleading posts, attach contextual notes, and vote on their helpfulness. However, our analysis of 30.8K health-related notes reveals significant latency, with a median delay of 17.6 hours before the first note receives a helpfulness status. To improve responsiveness during real-world misinformation surges, we propose CrowdNotes+, a unified framework that leverages large language models (LLMs) to augment Community Notes for faster and more reliable health misinformation governance. CrowdNotes+ integrates two complementary modes: (1) evidence-grounded note augmentation and (2) utility-guided note automation, along with a hierarchical three-step evaluation that progressively assesses relevance, correctness, and helpfulness. We instantiate the framework through HealthNotes, a benchmark of 1.2K helpfulness-annotated health notes paired with a fine-tuned helpfulness judge. Experiments on fifteen LLMs reveal an overlooked loophole in current helpfulness evaluation, where stylistic fluency is mistaken for factual accuracy, and demonstrate that our hierarchical evaluation and LLM-augmented generation jointly enhance factual precision and evidence utility. These results point toward a hybrid human-AI governance model that improves both the rigor and timeliness of crowd-sourced fact-checking.

Paper Structure

This paper contains 29 sections, 1 equation, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Overview of Community Notes on X for crowd-sourced misinformation governance. Platform users engage in three stages: (1) flagging potentially misleading posts, (2) contributing clarifications or contextual notes, and (3) rating the helpfulness of these notes. As votes accumulate, each note attains one of three statuses ("Not Enough Ratings", "Currently Rated Helpful", or "Currently Rated Not Helpful"), and only Helpful notes are publicly surfaced alongside the original misleading post to inform readers.
  • Figure 2: Spikes in flagged health misinformation posts correspond to major real-world health events (details in Section \ref{['sec:misinfo-dynamics']}), including outbreak alerts, vaccine updates, and policy debates, highlighting the event-driven nature of misinformation on X.
  • Figure 3: Overview of the proposed CrowdNotes+ framework for LLM-augmented Community Notes. The upper timeline depicts the crowd-sourced Community Notes workflow on X. The lower panels illustrate two LLM-augmented modes in CrowdNotes+: (1) evidence-grounded note augmentation, where LLMs write notes using human-provided evidence, and (2) utility-guided note automation, where LLMs autonomously retrieve evidence from the Web to generate notes more efficiently. Together, these modes enable scalable, timely, and reliable support for community-driven misinformation governance.
  • Figure 4: Example of a human-written note mislabeled as Helpful by human voters but correctly flagged as Not Helpful by CrowdNotes+ for citing irrelevant evidence.
  • Figure 5: Error distribution of 89 human-written notes that misrepresented evidence, grouped by three main causes.
  • ...and 4 more figures