Table of Contents
Fetching ...

A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

Shengxiang Gao, Fang nan, Yongbing Zhang, Yuxin Huang, Kaiwen Tan, Zhengtao Yu

TL;DR

A mixed-language multi-document news summarization dataset (MLMD-news) is constructed, which contains four different languages and 10,992 source document cluster and target summary pairs and proposes a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset.

Abstract

Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code\footnote[1]{https://github.com/Southnf9/MLMD-news}, aiming to advance research in summarization within MLMD scenarios.

A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

TL;DR

A mixed-language multi-document news summarization dataset (MLMD-news) is constructed, which contains four different languages and 10,992 source document cluster and target summary pairs and proposes a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset.

Abstract

Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and code\footnote[1]{https://github.com/Southnf9/MLMD-news}, aiming to advance research in summarization within MLMD scenarios.

Paper Structure

This paper contains 22 sections, 8 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: The diagram of SLSD, SLMD, CLSD and MLMD. Each rounded rectangle represents a source document, while the pointed rectangle represents the target summary. "En" "De" "Fr" and "Es" indicate that the text is in English, German, French, and Spanish, respectively.
  • Figure 2: The diagram illustrates the construction process of the MLMD-news dataset. First, a round-trip translation (RTT) strategy is employed to translate each news document in the source document clusters of the Multi-News dataset into multiple languages and then back into the original language. This process allows the calculation of the ROUGE-1 score matrix for the document cluster. Based on this score matrix, a greedy selection strategy is used to assign a corresponding language to each news document. The original content of the news document is then replaced with the translated content in the assigned language, resulting in a source document cluster with mixed languages.
  • Figure 3: The number of news documents in different languages across the training, validation, and test sets.
  • Figure 4: The proportion of the number of languages involved in the source document clusters across the training, validation, and test sets.
  • Figure 5: The framework of the extract-generate model involves three main components. In the Graph Construction, mixed-language source document clusters are constructed into both homogeneous and heterogeneous graphs. The Extractor extracts key sentences from the source document cluster, while the Generator generates a summary based on the sentences extracted by the Extractor.
  • ...and 1 more figures