Table of Contents
Fetching ...

Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

Yi-Ling Chung, Aurora Cobo, Pablo Serna

TL;DR

The paper tackles multilingual misinformation by introducing MultiSynFact, a 2.2M claim-source dataset generated with an LLM-driven pipeline that uses Wikipedia as external knowledge and rigorous validation. The approach combines knowledge extraction, multilingual claim generation, and MNLI-based filtering to ensure high-quality, language-diverse data. Empirical results show consistent improvements in monolingual, multilingual, and cross-lingual fact-checking when training with MultiSynFact, underscoring the value of synthetic multilingual data for generalization. The authors provide an open-source toolkit to enable broad adoption and future enhancements, advancing practical multilingual fact-checking, particularly for low-resource languages.

Abstract

Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.

Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking

TL;DR

The paper tackles multilingual misinformation by introducing MultiSynFact, a 2.2M claim-source dataset generated with an LLM-driven pipeline that uses Wikipedia as external knowledge and rigorous validation. The approach combines knowledge extraction, multilingual claim generation, and MNLI-based filtering to ensure high-quality, language-diverse data. Empirical results show consistent improvements in monolingual, multilingual, and cross-lingual fact-checking when training with MultiSynFact, underscoring the value of synthetic multilingual data for generalization. The authors provide an open-source toolkit to enable broad adoption and future enhancements, advancing practical multilingual fact-checking, particularly for low-resource languages.

Abstract

Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.

Paper Structure

This paper contains 32 sections, 1 figure, 18 tables.

Figures (1)

  • Figure 1: Automated pipeline of multilingual claims generation.