Table of Contents
Fetching ...

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

Lorraine Saju, Arnim Bleier, Jana Lasser, Claudia Wagner

TL;DR

This work addresses the reliability of LLM-based fact-checking across languages and topical domains by creating FactSpan, a dynamic multilingual dataset extending the X-Fact/ClaimReview sources to 61,514 claims spanning 30 languages (2007–2024). It evaluates five prominent LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 variants, Mixtral) on a fixed true/false task, revealing that GPT-4o attains the highest accuracy (~$73.31\%$) but also a high refusal rate (~$43\%$), while smaller open-source models show lower accuracy but fewer refusals. A key finding is that factual claims are harder to classify than opinions across all models, with performance strongly modulated by language-resource availability and claim features; post-cutoff generalization is robust for closed models, suggesting reliance on linguistic heuristics in addition to real-world knowledge. These results underscore the need for cautious deployment of LLM-based fact-checkers, highlight vulnerabilities to certain claim formulations, and motivate ongoing, data-driven evaluation and multi-modal, knowledge-augmented approaches for scalable, trustworthy verification.

Abstract

The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.

Facts are Harder Than Opinions -- A Multilingual, Comparative Analysis of LLM-Based Fact-Checking Reliability

TL;DR

This work addresses the reliability of LLM-based fact-checking across languages and topical domains by creating FactSpan, a dynamic multilingual dataset extending the X-Fact/ClaimReview sources to 61,514 claims spanning 30 languages (2007–2024). It evaluates five prominent LLMs (GPT-4o, GPT-3.5 Turbo, LLaMA 3.1 variants, Mixtral) on a fixed true/false task, revealing that GPT-4o attains the highest accuracy (~) but also a high refusal rate (~), while smaller open-source models show lower accuracy but fewer refusals. A key finding is that factual claims are harder to classify than opinions across all models, with performance strongly modulated by language-resource availability and claim features; post-cutoff generalization is robust for closed models, suggesting reliance on linguistic heuristics in addition to real-world knowledge. These results underscore the need for cautious deployment of LLM-based fact-checkers, highlight vulnerabilities to certain claim formulations, and motivate ongoing, data-driven evaluation and multi-modal, knowledge-augmented approaches for scalable, trustworthy verification.

Abstract

The proliferation of misinformation necessitates scalable, automated fact-checking solutions. Yet, current benchmarks often overlook multilingual and topical diversity. This paper introduces a novel, dynamically extensible data set that includes 61,514 claims in multiple languages and topics, extending existing datasets up to 2024. Through a comprehensive evaluation of five prominent Large Language Models (LLMs), including GPT-4o, GPT-3.5 Turbo, LLaMA 3.1, and Mixtral 8x7B, we identify significant performance gaps between different languages and topics. While overall GPT-4o achieves the highest accuracy, it declines to classify 43% of claims. Across all models, factual-sounding claims are misclassified more often than opinions, revealing a key vulnerability. These findings underscore the need for caution and highlight challenges in deploying LLM-based fact-checking systems at scale.

Paper Structure

This paper contains 34 sections, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Performance of Five LLMs on Factual vs. Opinion Claims: (Left) 'No Verdict' Percentage and (Right) Fact-Checking Accuracy, with a red line at .5 indicating chance-level performance for the accuracy subplot. Points are colored by claim type, allowing for a direct comparison of model behavior across these two categories.
  • Figure 2: Fact-checking accuracy of five LLMs (Mixtral 8x7B, LLaMA 3.1 (70B and 8B), GPT-4o, and GPT-3.5 Turbo) across 61,514 claims in multiple languages. Each subplot represents one model, with the x-axis showing different languages and the y-axis representing the fact-checking accuracy. For each language, we display the model’s performance on claims created before the model’s training cut-off (black dot) and after the cut-off (white cross). The size of the circle around each marker corresponds to the number of claims available for that language. The red line at 0.5 indicates chance-level performance, as fact-checking was framed as a binary classification task. Circle color denotes language resource class, based on the joshi_state_2020 taxonomy, with Class 5 being high-resource languages and Class 0 low-resource. GPT-3.5 Turbo and GPT-4o exhibit strong generalization beyond the official training cut-off dates, often outperforming their own pre-cutoff performance, especially in high-resource languages. The LLaMA 3.1 models also show slight improvements post-cutoff, though the gains are more modest. Mixtral's cut-off date is not publicly known, so pre/post distinctions are not shown for that model. The figure also illustrates large performance variability across languages and models, emphasizing the role of both language resources and model scale.
  • Figure 3: No verdict percentages of five LLMs (Mixtral 8x7B, LLaMA 3.1 (70B and 8B), GPT-4o, and GPT-3.5 Turbo) across 61,523 claims in multiple languages. Each subplot represents one model, with the x-axis showing different languages and the y-axis representing the no verdict percentages. For each language, we display the model's performance on claims created before the model’s training cut-off (black dot) and after the cut-off (white cross). The size of the circle around each marker corresponds to the volume of claims in that language. The figure highlights the variation in the fact checking coverage across languages and between models, showing that the best performing GPT4o model also had the highest reservation while choosing claims to fact check.