SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

Dongqi Liu; Yifan Wang; Jia Loy; Vera Demberg

SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

Dongqi Liu, Yifan Wang, Jia Loy, Vera Demberg

TL;DR

SciNews tackles the problem of translating scholarly articles into accessible scientific news by introducing a large, multidisciplinary parallel corpus of 41,872 paper–news pairs across nine domains. The authors formalize automated scientific news report generation as learning $P(Y|X)$ over pairs $(x_i,y_i)$, and they build a high-quality dataset from Science X with careful cleaning, quality control, and a train/validation/test split. Through extensive experiments with extractive and abstractive baselines (notably Longformer-, RSTformer-, SIMSUM-, Vicuna-, and GPT-4-based approaches), they show that abstractive, discourse-aware models achieve stronger performance on several automatic metrics, while human evaluations reveal ongoing challenges with faithfulness and coherence. The work demonstrates both the practical potential of SciNews for advancing narrative scientific communication and the remaining research gaps in producing faithful, readable long-form science news, with implications for downstream tasks such as topic classification and headline generation.

Abstract

Scientific news reports serve as a bridge, adeptly translating complex research articles into reports that resonate with the broader public. The automated generation of such narratives enhances the accessibility of scholarly insights. In this paper, we present a new corpus to facilitate this paradigm development. Our corpus comprises a parallel compilation of academic publications and their corresponding scientific news reports across nine disciplines. To demonstrate the utility and reliability of our dataset, we conduct an extensive analysis, highlighting the divergences in readability and brevity between scientific news narratives and academic manuscripts. We benchmark our dataset employing state-of-the-art text generation models. The evaluation process involves both automatic and human evaluation, which lays the groundwork for future explorations into the automated generation of scientific news reports. The dataset and code related to this work are available at https://dongqi.me/projects/SciNews.

SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

TL;DR

over pairs

, and they build a high-quality dataset from Science X with careful cleaning, quality control, and a train/validation/test split. Through extensive experiments with extractive and abstractive baselines (notably Longformer-, RSTformer-, SIMSUM-, Vicuna-, and GPT-4-based approaches), they show that abstractive, discourse-aware models achieve stronger performance on several automatic metrics, while human evaluations reveal ongoing challenges with faithfulness and coherence. The work demonstrates both the practical potential of SciNews for advancing narrative scientific communication and the remaining research gaps in producing faithful, readable long-form science news, with implications for downstream tasks such as topic classification and headline generation.

Abstract

Paper Structure (29 sections, 5 figures, 7 tables)

This paper contains 29 sections, 5 figures, 7 tables.

Introduction
Related Work
Scientific Lay Summarization
Scientific Text Simplification
The SciNews Dataset
Task Formulation
Data Acquisition
Data Cleaning
Quality Control
Data Splits
Dataset Analysis
Dataset Comparison
Dataset Statistics
Papers vs. News
Experiments
...and 14 more sections

Figures (5)

Figure 1: An example of an academic paper paired with its news report.
Figure 2: Topic distribution of our dataset
Figure 3: Absolute differences of proportion in linguistic structures (academic papers$-$news articles).
Figure 4: Consistency check
Figure :

SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

TL;DR

Abstract

SciNews: From Scholarly Complexities to Public Narratives -- A Dataset for Scientific News Report Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)