Table of Contents
Fetching ...

Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking

Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini

TL;DR

This work benchmarks Retrieval-Augmented Generation pipelines for professional fact-checking verdicts across neutral, SMP, and emotional claim styles using Gold and Silver knowledge bases. It systematically evaluates retrieval strategies (sparse, dense, hybrid, and LLM-based) and generation setups (zero-shot, one-shot, and fine-tuning) across multiple LLMs, revealing that LLM-based retrievers offer superior retrieval performance while large generation models improve verdict faithfulness; however heterogeneous KBs remain a challenge. The study also investigates the impact of preprocessing through fact extraction and chunk-based versus article-based retrieval, showing that preprocessing and KB design significantly influence performance. Human evaluation indicates zero-shot/one-shot approaches often maximize informativeness, while fine-tuning enhances emotional alignment, suggesting practical guidance for deploying RAG verdict systems in real-world fact-checking workflows.

Abstract

Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, following professional fact-checking practices, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking

TL;DR

This work benchmarks Retrieval-Augmented Generation pipelines for professional fact-checking verdicts across neutral, SMP, and emotional claim styles using Gold and Silver knowledge bases. It systematically evaluates retrieval strategies (sparse, dense, hybrid, and LLM-based) and generation setups (zero-shot, one-shot, and fine-tuning) across multiple LLMs, revealing that LLM-based retrievers offer superior retrieval performance while large generation models improve verdict faithfulness; however heterogeneous KBs remain a challenge. The study also investigates the impact of preprocessing through fact extraction and chunk-based versus article-based retrieval, showing that preprocessing and KB design significantly influence performance. Human evaluation indicates zero-shot/one-shot approaches often maximize informativeness, while fine-tuning enhances emotional alignment, suggesting practical guidance for deploying RAG verdict systems in real-world fact-checking workflows.

Abstract

Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, following professional fact-checking practices, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.

Paper Structure

This paper contains 34 sections, 4 figures, 12 tables.

Figures (4)

  • Figure 1: Visual representation of our RAG-based experimental design (the steps for retrieval and generation are indicated by the red and blue lines, respectively). We explored various configurations to tackle increasingly realistic scenarios across different claim styles (neutral, SMP, emotional) and Knowledge Bases (Gold vs. Silver), as well as varying computational demands through multiple retriever architectures (sparse, dense, hybrid, and LLM-based) and five distinct LLMs generation setups (zero-shot, one-shot, fine-tuning).
  • Figure 2: Human evaluation results: Percentages of preference for the four generation setups across the datasets.
  • Figure 3: Retrieval results for each type of retriever (sparse, dense, LLM, hybrid) across Gold_KB$_{art}$ and Gold_KB$_{chunks}$ are presented for all claim styles, both with (SMP Facts, Emotional Facts) and without (neutral, SMP, emotional) claim pre-processing. The metrics reported include hit_rate and MRR for retrieval over Gold_KB$_{art}$, and hit_rate and MAP for Gold_KB$_{chunks}$, for increasing values of retrieved documents/chunks ($k=1,...,10$).
  • Figure 4: Complete results for the human evaluation. Each matrix refers to the results obtained for each verdict evaluation aspect. The matrices report how many times, in percentage, the human annotators preferred each of the four generation setups (gold, zero-shot, one-shot, fine-tuning).