Face the Facts! Evaluating RAG-based Pipelines for Professional Fact-Checking
Daniel Russo, Stefano Menini, Jacopo Staiano, Marco Guerini
TL;DR
This work benchmarks Retrieval-Augmented Generation pipelines for professional fact-checking verdicts across neutral, SMP, and emotional claim styles using Gold and Silver knowledge bases. It systematically evaluates retrieval strategies (sparse, dense, hybrid, and LLM-based) and generation setups (zero-shot, one-shot, and fine-tuning) across multiple LLMs, revealing that LLM-based retrievers offer superior retrieval performance while large generation models improve verdict faithfulness; however heterogeneous KBs remain a challenge. The study also investigates the impact of preprocessing through fact extraction and chunk-based versus article-based retrieval, showing that preprocessing and KB design significantly influence performance. Human evaluation indicates zero-shot/one-shot approaches often maximize informativeness, while fine-tuning enhances emotional alignment, suggesting practical guidance for deploying RAG verdict systems in real-world fact-checking workflows.
Abstract
Natural Language Processing and Generation systems have recently shown the potential to complement and streamline the costly and time-consuming job of professional fact-checkers. In this work, we lift several constraints of current state-of-the-art pipelines for automated fact-checking based on the Retrieval-Augmented Generation (RAG) paradigm. Our goal is to benchmark, following professional fact-checking practices, RAG-based methods for the generation of verdicts - i.e., short texts discussing the veracity of a claim - evaluating them on stylistically complex claims and heterogeneous, yet reliable, knowledge bases. Our findings show a complex landscape, where, for example, LLM-based retrievers outperform other retrieval techniques, though they still struggle with heterogeneous knowledge bases; larger models excel in verdict faithfulness, while smaller models provide better context adherence, with human evaluations favouring zero-shot and one-shot approaches for informativeness, and fine-tuned models for emotional alignment.
