Table of Contents
Fetching ...

Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models

Vinay Setty

TL;DR

The paper tackles end-to-end multilingual fact-checking across 90+ languages by proposing a three-stage pipeline: check-worthy claim detection, evidence search, and veracity prediction. It demonstrates that fine-tuned Transformer models, notably XLM-RoBERTa-Large, outperform large language models on claim detection and veracity tasks, while LLMs excel at generating questions for evidence retrieval. For numerical claims, FinQA-RoBERTa-Large provides superior numerical reasoning, though LLMs remain strong in decomposition tasks; privacy considerations motivate self-hosted deployments. The results support a hybrid approach that leverages small, fine-tuned transformers for core reasoning and self-hosted LLMs for generative components, with broader validation and scaling identified as future work.

Abstract

In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.

Surprising Efficacy of Fine-Tuned Transformers for Fact-Checking over Larger Language Models

TL;DR

The paper tackles end-to-end multilingual fact-checking across 90+ languages by proposing a three-stage pipeline: check-worthy claim detection, evidence search, and veracity prediction. It demonstrates that fine-tuned Transformer models, notably XLM-RoBERTa-Large, outperform large language models on claim detection and veracity tasks, while LLMs excel at generating questions for evidence retrieval. For numerical claims, FinQA-RoBERTa-Large provides superior numerical reasoning, though LLMs remain strong in decomposition tasks; privacy considerations motivate self-hosted deployments. The results support a hybrid approach that leverages small, fine-tuned transformers for core reasoning and self-hosted LLMs for generative components, with broader validation and scaling identified as future work.

Abstract

In this paper, we explore the challenges associated with establishing an end-to-end fact-checking pipeline in a real-world context, covering over 90 languages. Our real-world experimental benchmarks demonstrate that fine-tuning Transformer models specifically for fact-checking tasks, such as claim detection and veracity prediction, provide superior performance over large language models (LLMs) like GPT-4, GPT-3.5-Turbo, and Mistral-7b. However, we illustrate that LLMs excel in generative tasks such as question decomposition for evidence retrieval. Through extensive evaluation, we show the efficacy of fine-tuned models for fact-checking in a multilingual setting and complex claims that include numerical quantities.
Paper Structure (16 sections, 3 figures, 4 tables)

This paper contains 16 sections, 3 figures, 4 tables.

Figures (3)

  • Figure 1: System Architecture of the Fact-Checking Pipeline at Factiverse
  • Figure 2: Evaluation of claim detection for 114 languages using Factiverse model, GPT-3.5-Turbo, GPT-4 and Mistral-7b.
  • Figure 3: Evaluation of veracity prediction for 46 languages.