CEHR-GPT: Generating Electronic Health Records with Chronological Patient Timelines
Chao Pang, Xinzhuo Jiang, Nishanth Parameshwar Pavinkurve, Krishna S. Kalluri, Elise L. Minto, Jason Patterson, Linying Zhang, George Hripcsak, Gamze Gürsoy, Noémie Elhadad, Karthik Natarajan
TL;DR
CEHR-GPT presents a transformer-based approach for generating time-series electronic health records by encoding OMOP data into a temporally rich patient representation and training a GPT to model patient sequences. The method preserves temporal structure through dedicated tokens (ATT, IATT, VS/VE) and enables near-lossless conversion back to OMOP via an OMOP encoder/decoder. Comprehensive evaluation across data utility, predictive performance, and privacy shows that CEHR-GPT produces realistic synthetic cohorts with strong alignment to real data while maintaining low privacy risk, aided by careful sampling and representation choices. The approach offers a scalable, standards-aligned pathway for disseminating synthetic EHR data suitable for model development, benchmarking, and external validation, with potential extensions to time-sensitive forecasting and broader data models.
Abstract
Synthetic Electronic Health Records (EHR) have emerged as a pivotal tool in advancing healthcare applications and machine learning models, particularly for researchers without direct access to healthcare data. Although existing methods, like rule-based approaches and generative adversarial networks (GANs), generate synthetic data that resembles real-world EHR data, these methods often use a tabular format, disregarding temporal dependencies in patient histories and limiting data replication. Recently, there has been a growing interest in leveraging Generative Pre-trained Transformers (GPT) for EHR data. This enables applications like disease progression analysis, population estimation, counterfactual reasoning, and synthetic data generation. In this work, we focus on synthetic data generation and demonstrate the capability of training a GPT model using a particular patient representation derived from CEHR-BERT, enabling us to generate patient sequences that can be seamlessly converted to the Observational Medical Outcomes Partnership (OMOP) data format.
