Table of Contents
Fetching ...

A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs

Yihan Lin, Zhirong Bella Yu, Simon Lee

TL;DR

This study assesses the current feasibility of generating synthetic Electronic Health Records with commercial Large Language Models, focusing on cross-hospital generalization and distribution fidelity. By comparing naive, schema-constrained, conditional, and group-based generation strategies on the eICU dataset, the authors show that LLMs can reliably mimic data for small feature subsets but struggle to preserve realistic joint distributions as dimensionality grows, limiting generalization across hospitals. A group-based conditioning approach improves fidelity and fairness across demographic groups, yet higher-dimensional setups (e.g., 83 features) undermine model performance and elevate privacy risks, as revealed by membership inference analyses. The work highlights a trade-off between feature richness and data quality, suggesting a sweet spot around 10 features with large sample sizes (10k) for practical synthetic data generation, and points to the need for advanced generative methods to scale high-dimensional healthcare data while maintaining privacy and utility. Overall, the findings inform design choices for synthetic EHR pipelines and emphasize careful consideration of dimensionality, group conditioning, and privacy safeguards in real-world deployments.

Abstract

Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.

A Case Study Exploring the Current Landscape of Synthetic Medical Record Generation with Commercial LLMs

TL;DR

This study assesses the current feasibility of generating synthetic Electronic Health Records with commercial Large Language Models, focusing on cross-hospital generalization and distribution fidelity. By comparing naive, schema-constrained, conditional, and group-based generation strategies on the eICU dataset, the authors show that LLMs can reliably mimic data for small feature subsets but struggle to preserve realistic joint distributions as dimensionality grows, limiting generalization across hospitals. A group-based conditioning approach improves fidelity and fairness across demographic groups, yet higher-dimensional setups (e.g., 83 features) undermine model performance and elevate privacy risks, as revealed by membership inference analyses. The work highlights a trade-off between feature richness and data quality, suggesting a sweet spot around 10 features with large sample sizes (10k) for practical synthetic data generation, and points to the need for advanced generative methods to scale high-dimensional healthcare data while maintaining privacy and utility. Overall, the findings inform design choices for synthetic EHR pipelines and emphasize careful consideration of dimensionality, group conditioning, and privacy safeguards in real-world deployments.

Abstract

Synthetic Electronic Health Records (EHRs) offer a valuable opportunity to create privacy preserving and harmonized structured data, supporting numerous applications in healthcare. Key benefits of synthetic data include precise control over the data schema, improved fairness and representation of patient populations, and the ability to share datasets without concerns about compromising real individuals privacy. Consequently, the AI community has increasingly turned to Large Language Models (LLMs) to generate synthetic data across various domains. However, a significant challenge in healthcare is ensuring that synthetic health records reliably generalize across different hospitals, a long standing issue in the field. In this work, we evaluate the current state of commercial LLMs for generating synthetic data and investigate multiple aspects of the generation process to identify areas where these models excel and where they fall short. Our main finding from this work is that while LLMs can reliably generate synthetic health records for smaller subsets of features, they struggle to preserve realistic distributions and correlations as the dimensionality of the data increases, ultimately limiting their ability to generalize across diverse hospital settings.

Paper Structure

This paper contains 50 sections, 8 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Overview of our Method
  • Figure 2: Bar plot illustrating the Membership Inference Attack Results, comparing AUC, Membership Advantage, and Empirical Risk across different numbers of features.
  • Figure 3: Several bar plots repeating the experiments from the main paper and comparing other commercial LLMs: Gemini and Claude.
  • Figure 4: The comparison of distributions between an LLM asked to generate 10 features versus all 83. We only plot continuous features but we see substantial differences in synthetic data generation fidelity.
  • Figure 5: Feature Importance Plot to help with our feature selection study