Table of Contents
Fetching ...

PreprintToPaper dataset: connecting bioRxiv preprints with journal publications

Fidan Badalova, Julian Sienkiewicz, Philipp Mayr

TL;DR

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process, and supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions.

Abstract

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (unpublished), and Gray Zone (potentially published but unlinked). To enhance reliability, title and author similarity scores were calculated, and a human-annotated subset of 299 records was created for evaluation of Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions. The dataset is publicly available in CSV format via Zenodo.

PreprintToPaper dataset: connecting bioRxiv preprints with journal publications

TL;DR

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process, and supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions.

Abstract

The PreprintToPaper dataset connects bioRxiv preprints with their corresponding journal publications, enabling large-scale analysis of the preprint-to-publication process. It comprises metadata for 145,517 preprints from two periods, 2016-2018 (pre-pandemic) and 2020-2022 (pandemic), retrieved via the bioRxiv and Crossref APIs. Each record includes bibliographic information such as titles, abstracts, authors, institutions, submission dates, licenses, and subject categories, alongside enriched publication metadata including journal names, publication dates, author lists, and further information. Preprints are categorized into three groups: Published (formally linked to a journal article), Preprint Only (unpublished), and Gray Zone (potentially published but unlinked). To enhance reliability, title and author similarity scores were calculated, and a human-annotated subset of 299 records was created for evaluation of Gray Zone cases. The dataset supports diverse applications, including studies of scholarly communication, open science policies, bibliometric tool development, and natural language processing research on textual changes between preprints and their published versions. The dataset is publicly available in CSV format via Zenodo.

Paper Structure

This paper contains 16 sections, 1 equation, 3 figures, 3 tables.

Figures (3)

  • Figure 1: PreprintToPaper dataset creation workflow. In Step 5 (Gray Zone verification), a human-annotated subset of 299 rows was created, limited to cases with a title match score of 0.75 and covering both periods (2016–2018 and 2020–2022). Two annotators manually checked the corresponding abstracts.
  • Figure 2: Distribution of published papers by title match score
  • Figure 3: (a) Combined annotations versus author match score for records characterized by the title match score equal to 0.75. Symbols are data points, and the curve comes from the fit to the logistic regression model (\ref{['eq:lm']}). Data points are artificially jittered around TRUE and FALSE values to avoid overlap. (b) The fraction of preprints that were eventually published as journal articles versus the year of preprint submission. Triangles represent data without the gray zone records, circles take into account gray zone records, with the author match score larger than 0.47 used to distinguish between preprints and eventually published journal papers.