Beyond Translation: LLM-Based Data Generation for Multilingual Fact-Checking
Yi-Ling Chung, Aurora Cobo, Pablo Serna
TL;DR
The paper tackles multilingual misinformation by introducing MultiSynFact, a 2.2M claim-source dataset generated with an LLM-driven pipeline that uses Wikipedia as external knowledge and rigorous validation. The approach combines knowledge extraction, multilingual claim generation, and MNLI-based filtering to ensure high-quality, language-diverse data. Empirical results show consistent improvements in monolingual, multilingual, and cross-lingual fact-checking when training with MultiSynFact, underscoring the value of synthetic multilingual data for generalization. The authors provide an open-source toolkit to enable broad adoption and future enhancements, advancing practical multilingual fact-checking, particularly for low-resource languages.
Abstract
Robust automatic fact-checking systems have the potential to combat online misinformation at scale. However, most existing research primarily focuses on English. In this paper, we introduce MultiSynFact, the first large-scale multilingual fact-checking dataset containing 2.2M claim-source pairs designed to support Spanish, German, English, and other low-resource languages. Our dataset generation pipeline leverages Large Language Models (LLMs), integrating external knowledge from Wikipedia and incorporating rigorous claim validation steps to ensure data quality. We evaluate the effectiveness of MultiSynFact across multiple models and experimental settings. Additionally, we open-source a user-friendly framework to facilitate further research in multilingual fact-checking and dataset generation.
