PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

Shahriar Noroozizadeh; Sayantan Kumar; George H. Chen; Jeremy C. Weiss

PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss

TL;DR

PMOA-TTS tackles the scarcity of large-scale temporally annotated clinical narratives by introducing a 124,699-case corpus of PubMed Open Access single-patient case reports converted into structured textual timelines with over 5.6 million timestamped events. The authors present an end-to-end LLM-driven pipeline for case-report identification, timeline extraction, demographic/diagnosis enrichment, and rigorous evaluation against a clinician-curated gold standard using event-match, temporal-concordance, and timestamp-discrepancy metrics. They demonstrate robust data properties, disease coverage, and temporal structure, and show the dataset’s utility through a downstream survival-analysis task that leverages textual timelines. The work provides open access data and code, enabling research on timeline reconstruction, temporal reasoning, and longitudinal outcome prediction while highlighting limitations and directions for future ontology-aware normalization and broader EHR-type applicability. Overall, PMOA-TTS advances temporal clinical NLP by delivering a scalable, reproducible resource that supports timeline-based reasoning and predictive modeling from narrative text.

Abstract

Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.

PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

TL;DR

Abstract

PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (14)