Table of Contents
Fetching ...

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

Max Grusky, Mor Naaman, Yoav Artzi

TL;DR

Newsroom tackles the data bottleneck in single-document summarization by releasing a 1.3 million article–summary dataset sourced from 38 news publishers over nearly two decades. It introduces robust measures to quantify extractiveness and abstraction (Coverage, Density, Compression) and analyzes how summaries vary across outlets, topics, and time. The paper systematically evaluates baseline extractive, abstractive, and mixed methods, complemented by a human evaluation protocol, demonstrating Newsroom's challenging diversity and potential for improving cross-domain performance. The dataset and evaluation framework enable data-intensive learning for real-world newsroom summarization and are publicly accessible for reproducibility.

Abstract

We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges.

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

TL;DR

Newsroom tackles the data bottleneck in single-document summarization by releasing a 1.3 million article–summary dataset sourced from 38 news publishers over nearly two decades. It introduces robust measures to quantify extractiveness and abstraction (Coverage, Density, Compression) and analyzes how summaries vary across outlets, topics, and time. The paper systematically evaluates baseline extractive, abstractive, and mixed methods, complemented by a human evaluation protocol, demonstrating Newsroom's challenging diversity and potential for improving cross-domain performance. The dataset and evaluation framework enable data-intensive learning for real-world newsroom summarization and are publicly accessible for reproducibility.

Abstract

We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges.

Paper Structure

This paper contains 29 sections, 3 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Newsroom summaries showing different extraction strategies, from time.com, mashable.com, and foxsports.com. Multi-word phrases shared between article and summary are underlined. Novel words used only in the summary are italicized.
  • Figure 2: Example summaries for existing datasets.
  • Figure 3: Procedure to compute the set $\mathcal{F}(A, S)$ of extractive phrases in summary $S$ extracted from article $A$. For each sequential token of the summary, $s_i$, the procedure iterates through tokens of the text, $a_j$. If tokens $s_i$ and $a_j$ match, the longest shared token sequence after $s_i$ and $a_j$ is marked as the extraction starting at $s_i$.
  • Figure 4: Density and coverage distributions across the different domains and existing datasets. Newsroom contains diverse summaries that exhibit a variety of summarization strategies. Each box is a normalized bivariate density plot of extractive fragment coverage (x-axis) and density (y-axis), the two measures of extraction described in Section \ref{['section:analysis:dimensions']}. The top left corner of each plot shows the number of training set articles $n$ and the median compression ratio $c$ of the articles. For DUC and New York Times, which have no standard data splits, $n$ is the total number of articles. Above, top left to bottom right: Plots for each publication in the Newsroom dataset. We omit TMZ, Economist, and ABC for presentation. Below, left to right: Plots for each summarization dataset showing increasing diversity of summaries along both dimensions of extraction in Newsroom.