Table of Contents
Fetching ...

RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation

Andrei-Marius Avram, Mircea Timpuriu, Andreea Iuga, Vlad-Cristian Matei, Iulian-Marius Tăiatu, Tudor Găină, Dumitru-Clementin Cercel, Florin Pop, Mihaela-Claudia Cercel

TL;DR

RoLargeSum addresses the scarcity of Romanian summarization data by introducing a large-scale dataset of Romanian and Moldavian news with summaries, headlines, dialect labels, and metadata. The authors benchmark multiple BART variants, Unlimiformer for long-context processing, and dialect-adversarial training to produce robust, dialect-aware baselines, complemented by evaluations of Romanian and multilingual LLMs and a human assessment of the best model. Key contributions include the dataset itself, dialect-aware subtasks, strong baselines, and insights from both automatic and human evaluations, along with a discussion of limitations and ethical considerations. The work advances Romanian NLP by providing a resource for accurate summary, headline, and keyword generation, enabling cross-dialect research and integration into benchmarks like LiRo, with practical implications for multilingual and low-resource settings.

Abstract

Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.

RoLargeSum: A Large Dialect-Aware Romanian News Dataset for Summary, Headline, and Keyword Generation

TL;DR

RoLargeSum addresses the scarcity of Romanian summarization data by introducing a large-scale dataset of Romanian and Moldavian news with summaries, headlines, dialect labels, and metadata. The authors benchmark multiple BART variants, Unlimiformer for long-context processing, and dialect-adversarial training to produce robust, dialect-aware baselines, complemented by evaluations of Romanian and multilingual LLMs and a human assessment of the best model. Key contributions include the dataset itself, dialect-aware subtasks, strong baselines, and insights from both automatic and human evaluations, along with a discussion of limitations and ethical considerations. The work advances Romanian NLP by providing a resource for accurate summary, headline, and keyword generation, enabling cross-dialect research and integration into benchmarks like LiRo, with practical implications for multilingual and low-resource settings.

Abstract

Using supervised automatic summarisation methods requires sufficient corpora that include pairs of documents and their summaries. Similarly to many tasks in natural language processing, most of the datasets available for summarization are in English, posing challenges for developing summarization models in other languages. Thus, in this work, we introduce RoLargeSum, a novel large-scale summarization dataset for the Romanian language crawled from various publicly available news websites from Romania and the Republic of Moldova that were thoroughly cleaned to ensure a high-quality standard. RoLargeSum contains more than 615K news articles, together with their summaries, as well as their headlines, keywords, dialect, and other metadata that we found on the targeted websites. We further evaluated the performance of several BART variants and open-source large language models on RoLargeSum for benchmarking purposes. We manually evaluated the results of the best-performing system to gain insight into the potential pitfalls of this data set and future development.

Paper Structure

This paper contains 42 sections, 7 equations, 2 figures, 10 tables.

Figures (2)

  • Figure 1: Boxplots depicting the numbers of words for the documents, summaries, headlines, and keywords in the RoLargeSum dataset.
  • Figure 2: The proposed baseline architecture used to generate summaries, headlines, and keywords in RoLargeSum. We chunk a set of tokens and add them to each chunk in the neighbouring context. After receiving the chunks, the encoder generates the corresponding embeddings, which are then used to build a kNN index for the cross-attention mechanism of the decoder. Finally, we also adversarially train a feed-forward network to detect the dialect of the input text, which helps the encoder produce dialect-independent embeddings.