Table of Contents
Fetching ...

Evaluating open-source Large Language Models for automated fact-checking

Nicolo' Fontana, Francesco Corso, Enrico Zuccolotto, Francesco Pierri

TL;DR

This paper evaluates open-source large language models for automated fact-checking across three tasks: identifying the connection between claims and articles, judging claims based on related fact-checking articles, and performing autonomous fact-checking using external sources. Using the Fact-Check Insights dataset and a robust prompt-engineering framework (including zero-shot, few-shot, and chain-of-thought prompts, plus ReAct for external knowledge), the study compares four open models against a RoBERTa baseline. Results show LLMs can excel at linking claims to articles but underperform in veracity judgments on news claims, particularly true statements, with external knowledge offering limited gains unless information is efficiently structured. The findings emphasize the current limits of open-source LLMs for fully autonomous fact-checking and suggest avenues for improvement through tailored prompting, evidence-aware reasoning, and hybrid approaches combining LLMs with fine-tuned SLMs.

Abstract

The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solutions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models' accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs' fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim-article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs' performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers.

Evaluating open-source Large Language Models for automated fact-checking

TL;DR

This paper evaluates open-source large language models for automated fact-checking across three tasks: identifying the connection between claims and articles, judging claims based on related fact-checking articles, and performing autonomous fact-checking using external sources. Using the Fact-Check Insights dataset and a robust prompt-engineering framework (including zero-shot, few-shot, and chain-of-thought prompts, plus ReAct for external knowledge), the study compares four open models against a RoBERTa baseline. Results show LLMs can excel at linking claims to articles but underperform in veracity judgments on news claims, particularly true statements, with external knowledge offering limited gains unless information is efficiently structured. The findings emphasize the current limits of open-source LLMs for fully autonomous fact-checking and suggest avenues for improvement through tailored prompting, evidence-aware reasoning, and hybrid approaches combining LLMs with fine-tuned SLMs.

Abstract

The increasing prevalence of online misinformation has heightened the demand for automated fact-checking solutions. Large Language Models (LLMs) have emerged as potential tools for assisting in this task, but their effectiveness remains uncertain. This study evaluates the fact-checking capabilities of various open-source LLMs, focusing on their ability to assess claims with different levels of contextual information. We conduct three key experiments: (1) evaluating whether LLMs can identify the semantic relationship between a claim and a fact-checking article, (2) assessing models' accuracy in verifying claims when given a related fact-checking article, and (3) testing LLMs' fact-checking abilities when leveraging data from external knowledge sources such as Google and Wikipedia. Our results indicate that LLMs perform well in identifying claim-article connections and verifying fact-checked stories but struggle with confirming factual news, where they are outperformed by traditional fine-tuned models such as RoBERTa. Additionally, the introduction of external knowledge does not significantly enhance LLMs' performance, calling for more tailored approaches. Our findings highlight both the potential and limitations of LLMs in automated fact-checking, emphasizing the need for further refinements before they can reliably replace human fact-checkers.

Paper Structure

This paper contains 18 sections, 13 figures.

Figures (13)

  • Figure 1: Example of prompt structure with optional configurations. The black text represents the main prompt, shared across all tasks and configurations. Red placeholders indicate where the actual article and statement are inserted for each task. Blue components correspond to optional prompt modules, which are integrated into the main prompt (at the indicated position) based on the specific combination being tested. For instance, a prompt incorporating both Enrich and Chain-of-Thought will consist of the black main prompt, with the blueEnrich component placed between the Role and Task sub-prompts, and the blueChain-of-Thought component appended at the end.
  • Figure 2: Task 1: Models' F1 scores computed for both classes. Fine-tuned RoBERTa is used as a reference baseline.
  • Figure 3: Task 1: For each model, for each prompt category (Zero-Shot, Few-Shot, and Chain-of-Tought) the best prompt's F1 score is reported. The average F1 score for each model is also reported with 0.95 confidence intervals.
  • Figure 4: Task 1: Comparison between the F1 scores obtained by each model for each prompt variation with respect to both classes. The bisector is reported as a reference.
  • Figure 5: Task 1: Percentage of faults for each model. The median faults percentage for each model is: 0.241 (Llama3 8B), 0.003 (Llama3 70B), 0.393 (Mixtral 8x7B)
  • ...and 8 more figures