OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

Martin Docekal; Martin Fajcik; Pavel Smrz

OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

Martin Docekal, Martin Fajcik, Pavel Smrz

TL;DR

OARelatedWork introduces a large-scale, open-access dataset for generating entire related-work sections from full-text sources, addressing limitations of abstract-based inputs. The authors assemble 94,450 target papers and 5,824,689 unique cited papers from CORE and Semantic Scholar, enriching data with bibliography links, enhanced content hierarchy, and expanded citation spans. They propose a BlockMatch meta-metric to evaluate long-form summaries and demonstrate that full-text inputs yield substantial gains across baselines, including PRIMERA and MPT-7b, with open problems in evaluation and domain bias. The work highlights practical implications for automatic generation of cohesive, context-rich related work sections and points to future directions like retrieval-augmented generation for more scalable, accurate systems.

Abstract

This paper introduces OARelatedWork, the first large-scale multi-document summarization dataset for related work generation containing whole related work sections and full-texts of cited papers. The dataset includes 94 450 papers and 5 824 689 unique referenced papers. It was designed for the task of automatically generating related work to shift the field toward generating entire related work sections from all available content instead of generating parts of related work sections from abstracts only, which is the current mainstream in this field for abstractive approaches. We show that the estimated upper bound for extractive summarization increases by 217% in the ROUGE-2 score, when using full content instead of abstracts. Furthermore, we show the benefits of full content data on naive, oracle, traditional, and transformer-based baselines. Long outputs, such as related work sections, pose challenges for automatic evaluation metrics like BERTScore due to their limited input length. We tackle this issue by proposing and evaluating a meta-metric using BERTScore. Despite operating on smaller blocks, we show this meta-metric correlates with human judgment, comparably to the original BERTScore.

OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

TL;DR

Abstract

Paper Structure (28 sections, 4 equations, 6 figures, 2 tables)

This paper contains 28 sections, 4 equations, 6 figures, 2 tables.

Introduction
OARelatedWork Dataset
Corpus Processing
Bibliography Linking
Content Hierarchy
Citation Spans
Document Content Cleaning
Related Work Dataset
Domain Shift
Tasks Definition
Evaluation
BlockMatch
Evaluation of BlockMatch Metric
Citation Metric
Related Work
...and 13 more sections

Figures (6)

Figure 1: The task is to generate a whole related work section from cited papers and the rest of the target paper. We also try variants using abstracts instead of full-texts.
Figure 2: An example of hierarchy parsing showing two major steps: Internal Numbering and creation of Anchor. Orange boxes contain notes with obstacles for the parser. See that we are able to guess section numbers for headlines with missing numbers by exploiting writing habits and format. The introduction is usually the first section, and we can see that the capital letters format is used for all (known) top-level headlines. The headline * H4 23 y 6 will be removed in the cleaning step.
Figure 3: Field of study domain shift between all papers in used corpus and dataset splits. As a single paper may have multiple fields of study, the counts do not add up to the total number of papers. The tpm is task (project management).
Figure 4: Violin plot showing 95% confidence interval of Pearson correlations of automatic metrics with human judgment. Coarse is a human judgment obtained by asking the evaluator about the faithfulness of the whole summary, whereas the fine was obtained by asking about smaller sub-parts. The middle lines are medians.
Figure 5: Histograms showing lengths of input types and lengths of outputs on train set. These numbers are obtained from PRIMERA tokenized inputs/outputs. GO for target paper is not created from abstract. GO summaries of cited documents are created for each document separately. We cut the right part of histograms to fit them on the page, thus the maximum is not there.
...and 1 more figures

OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

TL;DR

Abstract

OARelatedWork: A Large-Scale Dataset of Related Work Sections with Full-texts from Open Access Sources

Authors

TL;DR

Abstract

Table of Contents

Figures (6)