EUvsDisinfo: A Dataset for Multilingual Detection of Pro-Kremlin Disinformation in News Articles
João A. Leite, Olesya Razuvayevskaya, Kalina Bontcheva, Carolina Scarton
TL;DR
EUvsDisinfo addresses the challenge of multilingual, article-level disinformation detection by constructing the largest available dataset of pro-Kremlin narratives and credible counterparts. The authors build the resource from EUvsDisinfo debunks, extracting 18,249 articles across 42 languages over 8.5 years and annotating 508 topics, then analyze topic prevalence and temporal dynamics while benchmarking multilingual classifiers (MNB, SVM, mBERT, XLM-RoBERTa). They find robust performance from transformer models, with mBERT achieving an average F1-Macro of 0.83 across languages and notable language-specific strengths, highlighting the dataset’s value for cross-language detection and narrative analysis. The work provides a public resource and tooling for reproducibility and future research, and points to directions in evidence-aware fact-checking and responsible data sharing under FAIR and Apache 2.0 licensing.
Abstract
This work introduces EUvsDisinfo, a multilingual dataset of disinformation articles originating from pro-Kremlin outlets, along with trustworthy articles from credible / less biased sources. It is sourced directly from the debunk articles written by experts leading the EUvsDisinfo project. Our dataset is the largest to-date resource in terms of the overall number of articles and distinct languages. It also provides the largest topical and temporal coverage. Using this dataset, we investigate the dissemination of pro-Kremlin disinformation across different languages, uncovering language-specific patterns targeting certain disinformation topics. We further analyse the evolution of topic distribution over an eight-year period, noting a significant surge in disinformation content before the full-scale invasion of Ukraine in 2022. Lastly, we demonstrate the dataset's applicability in training models to effectively distinguish between disinformation and trustworthy content in multilingual settings.
