Table of Contents
Fetching ...

SCRum-9: Multilingual Stance Classification over Rumours on Social Media

Yue Li, Jake Vasilakes, Zhixue Zhao, Carolina Scarton

TL;DR

SCRum-9 introduces the largest multilingual rumour stance classification benchmark to date, spanning 9 languages with 7,516 tweet–reply pairs linked to 2,156 fact-checked rumours and annotated with confidence and second-choice labels to capture annotator uncertainty. The work provides a comprehensive data collection and annotation protocol, including topic-based filtering and a two-round adjudication process, and benchmarks both LLM-based in-context learning and multilingual MLM fine-tuning, augmented by multilingual synthetic data generated by LLMs. Key findings show substantial cross-language variation in ICL performance, with translation and few-shot demonstrations often helping non-English cases, and that synthetic multilingual data can power MLMs to competitive or superior performance while reducing compute costs. SCRum-9 offers new avenues for multilingual rumour analysis, uncertainty studies, and downstream tasks such as claim verification, with public release to spur further research.

Abstract

We introduce SCRum-9, the largest multilingual Stance Classification dataset for Rumour analysis in 9 languages, containing 7,516 tweets from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages, linking examples to more fact-checked claims (2.1k), and including confidence-related annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least two native speakers per language, totalling more than 405 hours of annotation and 8,150 dollars in compensation. Further, SCRum-9 is used to benchmark five large language models (LLMs) and two multilingual masked language models (MLMs) in In-Context Learning (ICL) and fine-tuning setups. This paper also innovates by exploring the use of multilingual synthetic data for rumour stance classification, showing that even LLMs with weak ICL performance can produce valuable synthetic data for fine-tuning small MLMs, enabling them to achieve higher performance than zero-shot ICL in LLMs. Finally, we examine the relationship between model predictions and human uncertainty on ambiguous cases finding that model predictions often match the second-choice labels assigned by annotators, rather than diverging entirely from human judgments. SCRum-9 is publicly released to the research community with potential to foster further research on multilingual analysis of misleading narratives on social media.

SCRum-9: Multilingual Stance Classification over Rumours on Social Media

TL;DR

SCRum-9 introduces the largest multilingual rumour stance classification benchmark to date, spanning 9 languages with 7,516 tweet–reply pairs linked to 2,156 fact-checked rumours and annotated with confidence and second-choice labels to capture annotator uncertainty. The work provides a comprehensive data collection and annotation protocol, including topic-based filtering and a two-round adjudication process, and benchmarks both LLM-based in-context learning and multilingual MLM fine-tuning, augmented by multilingual synthetic data generated by LLMs. Key findings show substantial cross-language variation in ICL performance, with translation and few-shot demonstrations often helping non-English cases, and that synthetic multilingual data can power MLMs to competitive or superior performance while reducing compute costs. SCRum-9 offers new avenues for multilingual rumour analysis, uncertainty studies, and downstream tasks such as claim verification, with public release to spur further research.

Abstract

We introduce SCRum-9, the largest multilingual Stance Classification dataset for Rumour analysis in 9 languages, containing 7,516 tweets from X. SCRum-9 goes beyond existing stance classification datasets by covering more languages, linking examples to more fact-checked claims (2.1k), and including confidence-related annotations from multiple annotators to account for intra- and inter-annotator variability. Annotations were made by at least two native speakers per language, totalling more than 405 hours of annotation and 8,150 dollars in compensation. Further, SCRum-9 is used to benchmark five large language models (LLMs) and two multilingual masked language models (MLMs) in In-Context Learning (ICL) and fine-tuning setups. This paper also innovates by exploring the use of multilingual synthetic data for rumour stance classification, showing that even LLMs with weak ICL performance can produce valuable synthetic data for fine-tuning small MLMs, enabling them to achieve higher performance than zero-shot ICL in LLMs. Finally, we examine the relationship between model predictions and human uncertainty on ambiguous cases finding that model predictions often match the second-choice labels assigned by annotators, rather than diverging entirely from human judgments. SCRum-9 is publicly released to the research community with potential to foster further research on multilingual analysis of misleading narratives on social media.

Paper Structure

This paper contains 45 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: An illustration of the multilinguality and annotation design of SCRum-9.
  • Figure 2: Statistics for SCRum-9 with labels determined with majority-voting over the first-choice stance labels.
  • Figure 3: Mean and standard deviation of zero-shot baseline ICL performance ($wF2$) on SCRum-9 with different LLMs. LLMs on bottom-right with high mean and low standard deviation exhibit good and relatively consistent performance across languages.
  • Figure 4: Comparison between ICL performances ($wF2$) across the eight non-English languages with Qwen.
  • Figure 5: Performance ($wF2$) comparison across languages between (1) XLM-R fine-tuned with English; (2) XLM-R fine-tuned with translated multilingual data; (3) Baseline zero-shot ICL performances of Gemma and Llama; and (4) Best ICL performances of Gemma and Llama.
  • ...and 5 more figures