A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs
Yihan Lin, Zhirong Bella Yu, Simon Lee
TL;DR
This study assesses the current feasibility of generating synthetic Electronic Health Records with commercial Large Language Models, focusing on cross-hospital generalization and distribution fidelity. By comparing naive, schema-constrained, conditional, and group-based generation strategies on the eICU dataset, the authors show that LLMs can reliably mimic data for small feature subsets but struggle to preserve realistic joint distributions as dimensionality grows, limiting generalization across hospitals. A group-based conditioning approach improves fidelity and fairness across demographic groups, yet higher-dimensional setups (e.g., 83 features) undermine model performance and elevate privacy risks, as revealed by membership inference analyses. The work highlights a trade-off between feature richness and data quality, suggesting a sweet spot around 10 features with large sample sizes (10k) for practical synthetic data generation, and points to the need for advanced generative methods to scale high-dimensional healthcare data while maintaining privacy and utility. Overall, the findings inform design choices for synthetic EHR pipelines and emphasize careful consideration of dimensionality, group conditioning, and privacy safeguards in real-world deployments.
Abstract
Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.
