Table of Contents
Fetching ...

PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

Franck Dernoncourt, Ji Young Lee

TL;DR

The paper introduces PubMed 200k RCT, the largest publicly available dataset for sequential sentence classification in medical abstracts, focusing on randomized controlled trials. It details construction from PubMed Baseline using MeSH-based RCT filtering and structure criteria, yielding 195,654 abstracts and two splits (20k variant). It provides data formats, a 3-way train/validation/test split, and baseline benchmarks (LR, Forward ANN, CRF, bi-ANN) to enable direct comparisons. The work demonstrates the dataset's potential to improve tools for information extraction and efficient literature review in medicine.

Abstract

We present PubMed 200k RCT, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

PubMed 200k RCT: a Dataset for Sequential Sentence Classification in Medical Abstracts

TL;DR

The paper introduces PubMed 200k RCT, the largest publicly available dataset for sequential sentence classification in medical abstracts, focusing on randomized controlled trials. It details construction from PubMed Baseline using MeSH-based RCT filtering and structure criteria, yielding 195,654 abstracts and two splits (20k variant). It provides data formats, a 3-way train/validation/test split, and baseline benchmarks (LR, Forward ANN, CRF, bi-ANN) to enable direct comparisons. The work demonstrates the dataset's potential to improve tools for information extraction and efficient literature review in medicine.

Abstract

We present PubMed 200k RCT, a new dataset based on PubMed for sequential sentence classification. The dataset consists of approximately 200,000 abstracts of randomized controlled trials, totaling 2.3 million sentences. Each sentence of each abstract is labeled with their role in the abstract using one of the following classes: background, objective, method, result, or conclusion. The purpose of releasing this dataset is twofold. First, the majority of datasets for sequential short-text classification (i.e., classification of short texts that appear in sequences) are small: we hope that releasing a new large dataset will help develop more accurate algorithms for this task. Second, from an application perspective, researchers need better tools to efficiently skim through the literature. Automatically classifying each sentence in an abstract would help researchers read abstracts more efficiently, especially in fields where abstracts may be long, such as the medical field.

Paper Structure

This paper contains 9 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Number of RCTs present in PubMed published yearly between 1960 and 2014 (inclusive). The first documented controlled trial dates back 1747 dunn1997james, but the scientific value of RCTs became widely recognized only by the late 20th century as the standard method for medical evidence meldrum2000brief.
  • Figure 2: Evolution of the percentage of RCT abstracts present in PubMed that are unstructured between 1975 and 2014 (inclusive). The years before 1975 were omitted due to the low number of RCTs. Overall, approximately half of the RCT abstracts are unstructured. An RCT abstract is considered as unstructured if and only if at least one of its section is labeled as "None".
  • Figure 3: Example of abstract with the method section highlighted. Abstracts in the medical field can be long. This abstract was taken from krogh2016ultrasound and several sentences have been removed for the sake of conciseness. Providing clinical researchers and practitioners a tool that would allow them to highlight the section(s) that they are interested in would help them explore the literature more efficiently.
  • Figure 4: Number of sentences per label
  • Figure 5: Distribution of the number of tokens the sentence. Minimum: 1; mean: 26.2; maximum: 338; variance: 227.6; skewness: 2.0; kurtosis: 8.7.
  • ...and 1 more figures