Use of Retrieval-Augmented Large Language Model Agent for Long-Form COVID-19 Fact-Checking
Jingyi Huang, Yuyi Yang, Mengmeng Ji, Charles Alba, Sheng Zhang, Ruopeng An
TL;DR
This work tackles the challenge of long-form COVID-19 misinformation by introducing SAFE, a multi-agent system that uses retrieval-augmented generation (LOTR-RAG) to extract and verify claims against a large, curated corpus of peer-reviewed literature. The system chunks long texts, employs dual embedding models, and grounds verifications with evidence retrieval, achieving higher consistency and more actionable explanations than baseline LLMs. An enhanced variant with Self-RAG provides iterative self-evaluation, though it did not improve overall performance in this study. Results demonstrate significant accuracy and explainability gains with SAFE, supporting scalable, evidence-grounded misinformation mitigation suitable for real-world public health use. The work also outlines limitations and directions for broader, multimodal, and cross-domain extensions.
Abstract
The COVID-19 infodemic calls for scalable fact-checking solutions that handle long-form misinformation with accuracy and reliability. This study presents SAFE (system for accurate fact extraction and evaluation), an agent system that combines large language models with retrieval-augmented generation (RAG) to improve automated fact-checking of long-form COVID-19 misinformation. SAFE includes two agents - one for claim extraction and another for claim verification using LOTR-RAG, which leverages a 130,000-document COVID-19 research corpus. An enhanced variant, SAFE (LOTR-RAG + SRAG), incorporates Self-RAG to refine retrieval via query rewriting. We evaluated both systems on 50 fake news articles (2-17 pages) containing 246 annotated claims (M = 4.922, SD = 3.186), labeled as true (14.1%), partly true (14.4%), false (27.0%), partly false (2.2%), and misleading (21.0%) by public health professionals. SAFE systems significantly outperformed baseline LLMs in all metrics (p < 0.001). For consistency (0-1 scale), SAFE (LOTR-RAG) scored 0.629, exceeding both SAFE (+SRAG) (0.577) and the baseline (0.279). In subjective evaluations (0-4 Likert scale), SAFE (LOTR-RAG) also achieved the highest average ratings in usefulness (3.640), clearness (3.800), and authenticity (3.526). Adding SRAG slightly reduced overall performance, except for a minor gain in clearness. SAFE demonstrates robust improvements in long-form COVID-19 fact-checking by addressing LLM limitations in consistency and explainability. The core LOTR-RAG design proved more effective than its SRAG-augmented variant, offering a strong foundation for scalable misinformation mitigation.
