Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models
Vinay Setty
TL;DR
The paper tackles end-to-end multilingual fact-checking across 90+ languages by proposing a three-stage pipeline: check-worthy claim detection, evidence search, and veracity prediction. It demonstrates that fine-tuned Transformer models, notably XLM-RoBERTa-Large, outperform large language models on claim detection and veracity tasks, while LLMs excel at generating questions for evidence retrieval. For numerical claims, FinQA-RoBERTa-Large provides superior numerical reasoning, though LLMs remain strong in decomposition tasks; privacy considerations motivate self-hosted deployments. The results support a hybrid approach that leverages small, fine-tuned transformers for core reasoning and self-hosted LLMs for generative components, with broader validation and scaling identified as future work.
Abstract
In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.
