Table of Contents
Fetching ...

SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

Dongqi Liu, Yifan Wang, Jia Loy, Vera Demberg

TL;DR

SciNews tackles the problem of translating scholarly articles into accessible scientific news by introducing a large, multidisciplinary parallel corpus of 41,872 paper–news pairs across nine domains. The authors formalize automated scientific news report generation as learning $P(Y|X)$ over pairs $(x_i,y_i)$, and they build a high-quality dataset from Science X with careful cleaning, quality control, and a train/validation/test split. Through extensive experiments with extractive and abstractive baselines (notably Longformer-, RSTformer-, SIMSUM-, Vicuna-, and GPT-4-based approaches), they show that abstractive, discourse-aware models achieve stronger performance on several automatic metrics, while human evaluations reveal ongoing challenges with faithfulness and coherence. The work demonstrates both the practical potential of SciNews for advancing narrative scientific communication and the remaining research gaps in producing faithful, readable long-form science news, with implications for downstream tasks such as topic classification and headline generation.

Abstract

Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.

SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

TL;DR

SciNews tackles the problem of translating scholarly articles into accessible scientific news by introducing a large, multidisciplinary parallel corpus of 41,872 paper–news pairs across nine domains. The authors formalize automated scientific news report generation as learning over pairs , and they build a high-quality dataset from Science X with careful cleaning, quality control, and a train/validation/test split. Through extensive experiments with extractive and abstractive baselines (notably Longformer-, RSTformer-, SIMSUM-, Vicuna-, and GPT-4-based approaches), they show that abstractive, discourse-aware models achieve stronger performance on several automatic metrics, while human evaluations reveal ongoing challenges with faithfulness and coherence. The work demonstrates both the practical potential of SciNews for advancing narrative scientific communication and the remaining research gaps in producing faithful, readable long-form science news, with implications for downstream tasks such as topic classification and headline generation.

Abstract

Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.
Paper Structure (29 sections, 5 figures, 7 tables)

This paper contains 29 sections, 5 figures, 7 tables.

Figures (5)

  • Figure 1: An example of an academic paper paired with its news report.
  • Figure 2: Topic distribution of our dataset
  • Figure 3: Absolute differences of proportion in linguistic structures (academic papers$-$news articles).
  • Figure 4: Consistency check
  • Figure :