Evaluating open-source Large Language Models for automated fact-checking
Nicolo' Fontana, Francesco Corso, Enrico Zuccolotto, Francesco Pierri
TL;DR
This paper evaluates open-source large language models for automated fact-checking across three tasks: identifying the connection between claims and articles, judging claims based on related fact-checking articles, and performing autonomous fact-checking using external sources. Using the Fact-Check Insights dataset and a robust prompt-engineering framework (including zero-shot, few-shot, and chain-of-thought prompts, plus ReAct for external knowledge), the study compares four open models against a RoBERTa baseline. Results show LLMs can excel at linking claims to articles but underperform in veracity judgments on news claims, particularly true statements, with external knowledge offering limited gains unless information is efficiently structured. The findings emphasize the current limits of open-source LLMs for fully autonomous fact-checking and suggest avenues for improvement through tailored prompting, evidence-aware reasoning, and hybrid approaches combining LLMs with fine-tuned SLMs.
Abstract
The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solutions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models' accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs' fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim-article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs' performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers.
