Table of Contents
Fetching ...

The State and Fate of Summarization Datasets: A Survey

Noam Dahan, Gabriel Stanovsky

TL;DR

This paper investigates the lack of standardization in automatic summarization datasets and presents an ontology to unify reporting across datasets. It surveys 133 datasets across 104 languages, annotating them along seven axes including language and modality, domain, summarize shape, supervision, and availability. Key findings include that abstractive versus extractive is a spectrum rather than a binary, heavy reliance on the news domain with quality concerns for low-resource languages, and copyright-driven shifts toward reconstruction-based distribution. To facilitate standardization, the authors provide an interactive web interface for dataset discovery and a data-card template that records sample information, annotation methods, data quality metrics such as the novel n-grams ratio defined as Novel n-grams ratio = (#New n-grams)/(#n-grams) and the compression rate. These contributions aim to streamline future research, improve cross-language comparability, and encourage broader use of high-quality multilingual summarization resources.

Abstract

Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.

The State and Fate of Summarization Datasets: A Survey

TL;DR

This paper investigates the lack of standardization in automatic summarization datasets and presents an ontology to unify reporting across datasets. It surveys 133 datasets across 104 languages, annotating them along seven axes including language and modality, domain, summarize shape, supervision, and availability. Key findings include that abstractive versus extractive is a spectrum rather than a binary, heavy reliance on the news domain with quality concerns for low-resource languages, and copyright-driven shifts toward reconstruction-based distribution. To facilitate standardization, the authors provide an interactive web interface for dataset discovery and a data-card template that records sample information, annotation methods, data quality metrics such as the novel n-grams ratio defined as Novel n-grams ratio = (#New n-grams)/(#n-grams) and the compression rate. These contributions aim to streamline future research, improve cross-language comparability, and encourage broader use of high-quality multilingual summarization resources.

Abstract

Automatic summarization has consistently attracted attention due to its versatility and wide application in various downstream tasks. Despite its popularity, we find that annotation efforts have largely been disjointed, and have lacked common terminology. Consequently, it is challenging to discover existing resources or identify coherent research directions. To address this, we survey a large body of work spanning 133 datasets in over 100 languages, creating a novel ontology covering sample properties, collection methods and distribution. With this ontology we make key observations, including the lack in accessible high-quality datasets for low-resource languages, and the field's over-reliance on the news domain and on automatically collected distant supervision. Finally, we make available a web interface that allows users to interact and explore our ontology and dataset collection, as well as a template for a summarization data card, which can be used to streamline future research into a more coherent body of work.

Paper Structure

This paper contains 144 sections, 1 equation, 7 figures, 2 tables.

Figures (7)

  • Figure 1: The number of summarization datasets being published is continuously increasing, yet the field lacks standardization and common terminology.
  • Figure 2: Our ontology for summarization datasets, accompanied by our annotations. The percentages in the language column indicate the proportion of datasets that support each language, where we count each multilingual dataset as multiple monolingual datasets. The arrows showcase the decision pathways for selecting specific datasets.
  • Figure 3: Every dot on the graph corresponds to a dataset in our collection. Each dot's position indicates the percentage of unique uni-grams in the summaries that are not found in the source texts. Only a minority of the datasets report an evaluation of the properties of the summaries.
  • Figure 4: The rise in the number of languages supported by summarization resources was achieved mainly through multilingual datasets. Most languages in our survey are only supported through them.
  • Figure 5: The distribution of domains for English datasets versus non-English datasets. While English domains are diverse, our survey shows most datasets for other languages are comprised of news articles.
  • ...and 2 more figures