Table of Contents
Fetching ...

Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search

Matthew R. DeVerna, Kai-Cheng Yang, Harry Yaojun Yan, Filippo Menczer

TL;DR

This study shows that standard large language models struggle with political fact-checking, with reasoning and web search offering only modest gains. A curated retrieval approach using PolitiFact article summaries dramatically boosts accuracy, achieving macro F1 improvements around 233% on average across models. The results imply that high-quality, claim-specific evidence is key to reliable automated fact-checking, more so than simply increasing model size or relying on uncurated web data. Consequently, scalable automated fact-checking should prioritize robust, curated retrieval pipelines and careful source curation to minimize bias and improve trust in AI-assisted verification.

Abstract

Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.

Large Language Models Require Curated Context for Reliable Political Fact-Checking -- Even with Reasoning and Web Search

TL;DR

This study shows that standard large language models struggle with political fact-checking, with reasoning and web search offering only modest gains. A curated retrieval approach using PolitiFact article summaries dramatically boosts accuracy, achieving macro F1 improvements around 233% on average across models. The results imply that high-quality, claim-specific evidence is key to reliable automated fact-checking, more so than simply increasing model size or relying on uncurated web data. Consequently, scalable automated fact-checking should prioritize robust, curated retrieval pipelines and careful source curation to minimize bias and improve trust in AI-assisted verification.

Abstract

Large language models (LLMs) have raised hopes for automated end-to-end fact-checking, but prior studies report mixed results. As mainstream chatbots increasingly ship with reasoning capabilities and web search tools -- and millions of users already rely on them for verification -- rigorous evaluation is urgent. We evaluate 15 recent LLMs from OpenAI, Google, Meta, and DeepSeek on more than 6,000 claims fact-checked by PolitiFact, comparing standard models with reasoning- and web-search variants. Standard models perform poorly, reasoning offers minimal benefits, and web search provides only moderate gains, despite fact-checks being available on the web. In contrast, a curated RAG system using PolitiFact summaries improved macro F1 by 233% on average across model variants. These findings suggest that giving models access to curated high-quality context is a promising path for automated fact-checking.

Paper Structure

This paper contains 48 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Data pipeline and study design. We collect PolitiFact claims, verdicts, article text, and metadata; generate evidence-focused summaries of fact-checking articles; and build a curated evidence database. We then evaluate 15 LLMs from four major providers with varying capabilities in two conditions, baseline (no retrieval) and with $k$ curated fact-checking article summaries ($k \in {3,6,9}$). Models predict claim veracity, and we compare their predictions to PolitiFact's verdicts.
  • Figure 2: Fact-checking performance of standard LLMs and retrieval performance of the Curated RAG system. (a) Macro F1 scores in zero-shot ($k=0$) and Curated RAG ($k>0$) conditions. Shapes and colors denote Curated RAG settings and model providers. Vertical bars show average F1 scores across Curated RAG settings ($k=3,6,9$); horizontal annotations show improvement over zero-shot. (b) Distribution of retrieval ranks for matching summaries across tested claims; the red square marks the median and the orange circle marks the mean. The y-axis displays random jitter for visualization clarity. (c) Top-$k$ retrieval accuracy for each setting.
  • Figure 3: Fact-checking performance of LLMs with (a) reasoning and (b) web search capabilities. Triangles show zero-shot performance of the corresponding standard models used as baselines: GPT-4o for o3-mini and o1; Gemini 2.0 Flash for Flash Thinking; DeepSeek-V3 for R1; and the non-search equivalent for search-enabled models (e.g., GPT-4o for GPT-4o Search). Circles denote zero-shot performance ($k=0$) of reasoning and search-enhanced models respectively in (a) and (b), while diamonds show their Curated RAG-enhanced performance at $k=6$. Horizontal annotations indicate performance differences: zero-shot reasoning/search variants compared with baselines (above symbols) and with the Curated RAG setting ($k=6$; below symbols).
  • Figure 4: Sources cited by search-enhanced GPT models. (a) Average number of citations by domain type. Error bars indicate standard deviation across $k$ values. (b) Top 10 sources cited by GPT-4o mini Search for $k=0$ and $k=6$. (c) Same as (b) for GPT-4o Search.
  • Figure 5: Joint distribution of NewsGuard reliability and political leaning scores for sources cited by search-enhanced GPT models. Marginal distributions are shown in the top and right panels for all citations (blue) and for citations excluding politifact.com (red). Black dashed lines separate NewsGuard group labels, and annotated percentages indicate the share of sources falling in each group; values in parentheses report the same percentages with PolitiFact excluded.
  • ...and 5 more figures