Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies
Max Grusky, Mor Naaman, Yoav Artzi
TL;DR
Newsroom tackles the data bottleneck in single-document summarization by releasing a 1.3 million article–summary dataset sourced from 38 news publishers over nearly two decades. It introduces robust measures to quantify extractiveness and abstraction (Coverage, Density, Compression) and analyzes how summaries vary across outlets, topics, and time. The paper systematically evaluates baseline extractive, abstractive, and mixed methods, complemented by a human evaluation protocol, demonstrating Newsroom's challenging diversity and potential for improving cross-domain performance. The dataset and evaluation framework enable data-intensive learning for real-world newsroom summarization and are publicly accessible for reproducibility.
Abstract
We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges.
