Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability
Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner
TL;DR
This work addresses the reliability of LLM-based fact-checking across languages and topical domains by creating FactSpan, a dynamic multilingual dataset extending the X-Fact/ClaimReview sources to 61,514 claims spanning 30 languages (2007–2024). It evaluates five prominent LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 variants, Mixtral) on a fixed true/false task, revealing that GPT-4o attains the highest accuracy (~$73.31\%$) but also a high refusal rate (~$43\%$), while smaller open-source models show lower accuracy but fewer refusals. A key finding is that factual claims are harder to classify than opinions across all models, with performance strongly modulated by language-resource availability and claim features; post-cutoff generalization is robust for closed models, suggesting reliance on linguistic heuristics in addition to real-world knowledge. These results underscore the need for cautious deployment of LLM-based fact-checkers, highlight vulnerabilities to certain claim formulations, and motivate ongoing, data-driven evaluation and multi-modal, knowledge-augmented approaches for scalable, trustworthy verification.
Abstract
The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.
