Table of Contents
Fetching ...

PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

Shahriar Noroozizadeh, Sayantan Kumar, George H. Chen, Jeremy C. Weiss

TL;DR

PMOA-TTS tackles the scarcity of large-scale temporally annotated clinical narratives by introducing a 124,699-case corpus of PubMed Open Access single-patient case reports converted into structured textual timelines with over 5.6 million timestamped events. The authors present an end-to-end LLM-driven pipeline for case-report identification, timeline extraction, demographic/diagnosis enrichment, and rigorous evaluation against a clinician-curated gold standard using event-match, temporal-concordance, and timestamp-discrepancy metrics. They demonstrate robust data properties, disease coverage, and temporal structure, and show the dataset’s utility through a downstream survival-analysis task that leverages textual timelines. The work provides open access data and code, enabling research on timeline reconstruction, temporal reasoning, and longitudinal outcome prediction while highlighting limitations and directions for future ontology-aware normalization and broader EHR-type applicability. Overall, PMOA-TTS advances temporal clinical NLP by delivering a scalable, reproducible resource that supports timeline-based reasoning and predictive modeling from narrative text.

Abstract

Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.

PMOA-TTS: Introducing the PubMed Open Access Textual Times Series Corpus

TL;DR

PMOA-TTS tackles the scarcity of large-scale temporally annotated clinical narratives by introducing a 124,699-case corpus of PubMed Open Access single-patient case reports converted into structured textual timelines with over 5.6 million timestamped events. The authors present an end-to-end LLM-driven pipeline for case-report identification, timeline extraction, demographic/diagnosis enrichment, and rigorous evaluation against a clinician-curated gold standard using event-match, temporal-concordance, and timestamp-discrepancy metrics. They demonstrate robust data properties, disease coverage, and temporal structure, and show the dataset’s utility through a downstream survival-analysis task that leverages textual timelines. The work provides open access data and code, enabling research on timeline reconstruction, temporal reasoning, and longitudinal outcome prediction while highlighting limitations and directions for future ontology-aware normalization and broader EHR-type applicability. Overall, PMOA-TTS advances temporal clinical NLP by delivering a scalable, reproducible resource that supports timeline-based reasoning and predictive modeling from narrative text.

Abstract

Clinical narratives encode temporal dynamics essential for modeling patient trajectories, yet large-scale temporally annotated resources are scarce. We introduce PMOA-TTS, a corpus of 124,699 single-patient PubMed Open Access case reports converted into structured textual timelines of (event, time) pairs using a scalable large-language-model pipeline (Llama 3.3 70B and DeepSeek-R1). The corpus comprises over 5.6 million timestamped events, alongside extracted demographics and diagnoses. Technical validation uses a clinician-curated gold set and three measures: semantic event matching, temporal concordance (c-index), and alignment error summarized with Area Under the Log-Time CDF (AULTC). We benchmark alternative prompting and model choices and provide documentation to support reproduction. PMOA-TTS enables research on timeline extraction, temporal reasoning, survival modeling and event forecasting from narrative text, and offers broad diagnostic and demographic coverage. Data and code are openly available in public repositories.

Paper Structure

This paper contains 46 sections, 5 equations, 14 figures, 7 tables, 1 algorithm.

Figures (14)

  • Figure 1: Example case report (left) with text-ordered event-time tuples (right).
  • Figure 2: Flowchart of the extraction and annotation pipeline. The left panel (Extraction) shows how we filtered the PMOA corpus to identify single-patient case reports. The middle panel (Annotation) depicts the generation of textual time series for each case via LLM prompting and the creation of evaluation subsets. The right panel (Assessment) summarizes the evaluation process, including metadata comparison, text event matching, time discrepancy analysis (log-time CDF, AULTC), and temporal order concordance.
  • Figure 3: top: PMOA-TTS record schema. bottom: compact example matching the schema above (drawn from our HuggingFace repository (formatted for readability).
  • Figure 4: Age, sex and ethnicity of 124,699 PMOA case reports
  • Figure 6: Frequency, co-occurrence, and prevalence patterns of UMLS-normalized diagnoses in PMOA-TTS. (a) The 20 most frequently mentioned diagnoses using canonical UMLS names. (b) Pairwise diagnosis co-occurrence (log-transformed counts) with hierarchical clustering, highlighting groups of frequently co-mentioned conditions. (c) Prevalence of coarse disease groups in PMOA-TTS compared with published U.S. adult baseline estimates, illustrating systematic differences between case-report-derived narratives and general-population distributions.
  • ...and 9 more figures