Table of Contents
Fetching ...

Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs

Premtim Sahitaj, Iffat Maab, Junichi Yamagishi, Jawan Kolanowski, Sebastian Möller, Vera Schmitt

TL;DR

This work tackles automated fact-checking of real-world claims by evaluating retrieval-augmented LLMs in a few-shot setup across multiple labeling schemes. Using 17,856 PolitiFact claims, the authors show that larger Llama-3 models plus evidence retrieval yield the strongest performance gains in both verdict accuracy and justification quality, while finer label granularity reduces classification performance. The study also benchmarks an upper-bound classifier trained with ModernBERT-large to contextualize achievable performance and confirms the value of evidence grounding. Overall, the results demonstrate the potential of retrieval-augmented AFC with LLMs for scalable, explainable fact-checking, while highlighting challenges in label granularity and the risk of hallucinations that warrant further user-centered evaluation and interactive validation.

Abstract

Fact-checking is necessary to address the increasing volume of misinformation. Traditional fact-checking relies on manual analysis to verify claims, but it is slow and resource-intensive. This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs) across multiple labeling schemes (binary, three-class, five-class) and extends traditional claim verification by incorporating analysis, verdict classification, and explanation in a structured setup to provide comprehensive justifications for real-world claims. We evaluate Llama-3 models of varying sizes (3B, 8B, 70B) on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches. We utilize TIGERScore as a reference-free evaluation metric to score the justifications. Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning. We find that smaller LLMs in a one-shot scenario provide comparable task performance to fine-tuned Small Language Models (SLMs) with large context sizes, while larger LLMs consistently surpass them. Evidence integration improves performance across all models, with larger LLMs benefiting most. Distinguishing between nuanced labels remains challenging, emphasizing the need for further exploration of labeling schemes and alignment with evidences. Our findings demonstrate the potential of retrieval-augmented AFC with LLMs.

Towards Automated Fact-Checking of Real-World Claims: Exploring Task Formulation and Assessment with LLMs

TL;DR

This work tackles automated fact-checking of real-world claims by evaluating retrieval-augmented LLMs in a few-shot setup across multiple labeling schemes. Using 17,856 PolitiFact claims, the authors show that larger Llama-3 models plus evidence retrieval yield the strongest performance gains in both verdict accuracy and justification quality, while finer label granularity reduces classification performance. The study also benchmarks an upper-bound classifier trained with ModernBERT-large to contextualize achievable performance and confirms the value of evidence grounding. Overall, the results demonstrate the potential of retrieval-augmented AFC with LLMs for scalable, explainable fact-checking, while highlighting challenges in label granularity and the risk of hallucinations that warrant further user-centered evaluation and interactive validation.

Abstract

Fact-checking is necessary to address the increasing volume of misinformation. Traditional fact-checking relies on manual analysis to verify claims, but it is slow and resource-intensive. This study establishes baseline comparisons for Automated Fact-Checking (AFC) using Large Language Models (LLMs) across multiple labeling schemes (binary, three-class, five-class) and extends traditional claim verification by incorporating analysis, verdict classification, and explanation in a structured setup to provide comprehensive justifications for real-world claims. We evaluate Llama-3 models of varying sizes (3B, 8B, 70B) on 17,856 claims collected from PolitiFact (2007-2024) using evidence retrieved via restricted web searches. We utilize TIGERScore as a reference-free evaluation metric to score the justifications. Our results show that larger LLMs consistently outperform smaller LLMs in classification accuracy and justification quality without fine-tuning. We find that smaller LLMs in a one-shot scenario provide comparable task performance to fine-tuned Small Language Models (SLMs) with large context sizes, while larger LLMs consistently surpass them. Evidence integration improves performance across all models, with larger LLMs benefiting most. Distinguishing between nuanced labels remains challenging, emphasizing the need for further exploration of labeling schemes and alignment with evidences. Our findings demonstrate the potential of retrieval-augmented AFC with LLMs.

Paper Structure

This paper contains 15 sections, 2 figures, 7 tables.

Figures (2)

  • Figure 1: Example data point of a statement made by the New York Times Editorial Board and evaluated by PolitiFact as False.
  • Figure 2: Analysis of the New York Times editorial case involving Sarah Palin.