Table of Contents
Fetching ...

CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines

Chao Pang, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Elise L. Minto, Jason Patterson, Linying Zhang, George Hripcsak, Gamze Gürsoy, Noémie Elhadad, Karthik Natarajan

TL;DR

CEHR-GPT presents a transformer-based approach for generating time-series electronic health records by encoding OMOP data into a temporally rich patient representation and training a GPT to model patient sequences. The method preserves temporal structure through dedicated tokens (ATT, IATT, VS/VE) and enables near-lossless conversion back to OMOP via an OMOP encoder/decoder. Comprehensive evaluation across data utility, predictive performance, and privacy shows that CEHR-GPT produces realistic synthetic cohorts with strong alignment to real data while maintaining low privacy risk, aided by careful sampling and representation choices. The approach offers a scalable, standards-aligned pathway for disseminating synthetic EHR data suitable for model development, benchmarking, and external validation, with potential extensions to time-sensitive forecasting and broader data models.

Abstract

Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format.

CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines

TL;DR

CEHR-GPT presents a transformer-based approach for generating time-series electronic health records by encoding OMOP data into a temporally rich patient representation and training a GPT to model patient sequences. The method preserves temporal structure through dedicated tokens (ATT, IATT, VS/VE) and enables near-lossless conversion back to OMOP via an OMOP encoder/decoder. Comprehensive evaluation across data utility, predictive performance, and privacy shows that CEHR-GPT produces realistic synthetic cohorts with strong alignment to real data while maintaining low privacy risk, aided by careful sampling and representation choices. The approach offers a scalable, standards-aligned pathway for disseminating synthetic EHR data suitable for model development, benchmarking, and external validation, with potential extensions to time-sensitive forecasting and broader data models.

Abstract

Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format.
Paper Structure (33 sections, 12 equations, 12 figures, 10 tables)

This paper contains 33 sections, 12 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: The OMOP data is first converted to patient sequences by an OMOP encoder based on the patient representation that preserves demographics, visit types, and temporal intervals between visits. Then a generative model is trained to learn the sequence distribution to generate new sequences. Next, the generated sequences are converted back to the OMOP format using an OMOP decoder.
  • Figure 2: The patient representation preserves demographics, visit types, and temporal intervals between visits and inpatient duration. It's designed to have the demographic prompt at the beginning including year at the first visit, age at the first visit, gender and race tokens, then followed by a series of visit blocks to represent the complete patient timeline. An artificial time token (ATT) is inserted between the neighboring visit blocks to keep track of the time intervals in days. In each visit block, all the essential information is retained including the visit type and domain records. In the case of inpatient visits, the inpatient ATT tokens (representing time intervals in days) are inserted between groups of concepts that occur on the same day, in addition, a discharge token is provided at the end of the visit block.
  • Figure 3: KL divergence for comparing concept probability distribution between synthetic data and real data. The probabilities of concepts were calculated on the scale of the entire population.
  • Figure 4: Concept prevalence comparison between the source OMOP and generated OMOP using top $p=95\%$ in the log scale stratified by domain in columns and by population in rows, where x-axis and y-axis represent the source and the synthetic data respectively, and each dot represents a concept
  • Figure 5: KL divergence associated with different synthetic data. The closer to the lower bound in the bottom left corner, the better the synthetic data.
  • ...and 7 more figures