M2DS: Multilingual Dataset for Multi-document Summarisation
Kushan Hewapathirana, Nisansa de Silva, C. D. Athuraliya
TL;DR
The paper introduces M2DS, the first multilingual multi-document summarisation dataset built from BBC articles across five languages (English, Japanese, Korean, Tamil, Sinhala) spanning 2010–2023. It documents dataset development from the M3LS-based link extraction, quality validation, and front-page expansion, resulting in 180.0k documents in 51.5k clusters (avg 3.5 per cluster) stored as a HuggingFace DatasetDict. The authors evaluate baseline and state-of-the-art models (PRIMERA, PEGASUS, LED, and Llama 2 7B) with language-specific splits, revealing that while Llama 2 performs strongly in zero-shot scenarios, language-specific MDS models still dominate English performance after fine-tuning. This work highlights the multilingual gap in MDS research, provides a high-quality multilingual benchmark, and points to future directions in language-adaptive modelling and cross-lingual transfer for broader applicability.
Abstract
In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape, where linguistic diversity is celebrated. Media platforms such as British Broadcasting Corporation (BBC) have disseminated news in 20+ languages for decades. With only 380 million people speaking English natively as their first language, accounting for less than 5% of the global population, the vast majority primarily relies on other languages. These facts underscore the need for inclusivity in MDS research, utilising resources from diverse languages. Recognising this gap, we present the Multilingual Dataset for Multi-document Summarisation (M2DS), which, to the best of our knowledge, is the first dataset of its kind. It includes document-summary pairs in five languages from BBC articles published during the 2010-2023 period. This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.
