Table of Contents
Fetching ...

From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories

Guanglin Zhou, Armin Catic, Motahare Shabestari, Matthew Young, Chaiquan Li, Katrina Poppe, Sebastiano Barbieri

TL;DR

An integrated pipeline is developed to make synthetic patient trajectories clinically consistent through two synergistic steps: high-fidelity generation and scalable auditing and an automated auditing module leveraging large language models to filter out clinical inconsistencies that escape probabilistic generation.

Abstract

Access to electronic health records (EHRs) for digital health research is often limited by privacy regulations and institutional barriers. Synthetic EHRs have been proposed as a way to enable safe and sovereign data sharing; however, existing methods may produce records that capture overall statistical properties of real data but present inconsistencies across clinical processes and observations. We developed an integrated pipeline to make synthetic patient trajectories clinically consistent through two synergistic steps: high-fidelity generation and scalable auditing. Using the MIMIC-IV database, we trained a knowledge-grounded generative model that represents nearly 32,000 distinct clinical events, including demographics, laboratory measurements, medications, procedures, and diagnoses, while enforcing structural integrity. To support clinical consistency at scale, we incorporated an automated auditing module leveraging large language models to filter out clinical inconsistencies (e.g., contraindicated medications) that escape probabilistic generation. We generated 18,071 synthetic patient records derived from a source cohort of 180,712 real patients. While synthetic clinical event probabilities demonstrated robust agreement (mean bias effectively 0.00) and high correlation (R2=0.99) with the real counterparts, review of a random sample of synthetic records (N=20) by three clinicians identified inconsistencies in 45-60% of them. Automated auditing reduced the difference between real and synthetic data (Cohen's effect size d between 0.59 and 1.60 before auditing, and between 0.18 and 0.67 after auditing). Downstream models trained on audited data matched or even exceeded real-data performance. We found no evidence of privacy risks, with membership inference performance indistinguishable from random guessing (F1-score=0.51).

From Statistical Fidelity to Clinical Consistency: Scalable Generation and Auditing of Synthetic Patient Trajectories

TL;DR

An integrated pipeline is developed to make synthetic patient trajectories clinically consistent through two synergistic steps: high-fidelity generation and scalable auditing and an automated auditing module leveraging large language models to filter out clinical inconsistencies that escape probabilistic generation.

Abstract

Access to electronic health records (EHRs) for digital health research is often limited by privacy regulations and institutional barriers. Synthetic EHRs have been proposed as a way to enable safe and sovereign data sharing; however, existing methods may produce records that capture overall statistical properties of real data but present inconsistencies across clinical processes and observations. We developed an integrated pipeline to make synthetic patient trajectories clinically consistent through two synergistic steps: high-fidelity generation and scalable auditing. Using the MIMIC-IV database, we trained a knowledge-grounded generative model that represents nearly 32,000 distinct clinical events, including demographics, laboratory measurements, medications, procedures, and diagnoses, while enforcing structural integrity. To support clinical consistency at scale, we incorporated an automated auditing module leveraging large language models to filter out clinical inconsistencies (e.g., contraindicated medications) that escape probabilistic generation. We generated 18,071 synthetic patient records derived from a source cohort of 180,712 real patients. While synthetic clinical event probabilities demonstrated robust agreement (mean bias effectively 0.00) and high correlation (R2=0.99) with the real counterparts, review of a random sample of synthetic records (N=20) by three clinicians identified inconsistencies in 45-60% of them. Automated auditing reduced the difference between real and synthetic data (Cohen's effect size d between 0.59 and 1.60 before auditing, and between 0.18 and 0.67 after auditing). Downstream models trained on audited data matched or even exceeded real-data performance. We found no evidence of privacy risks, with membership inference performance indistinguishable from random guessing (F1-score=0.51).
Paper Structure (24 sections, 4 equations, 7 figures, 6 tables)

This paper contains 24 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Fidelity and structural evaluation of synthetic EHR data.a Vocabulary coverage across 31,901 concepts. While standard benchmarks (blue) capture high-frequency diagnoses, Coogee models the "long tail" of clinical care (red), including specific medications (e.g., Semaglutide) and laboratory tests (e.g., HbA1c) required for high-fidelity disease modeling. b-c, Structural characteristics comparing real (blue) and synthetic (red) cohorts. The synthetic data mirrors real-world distributions for visits per patient and codes per visit, with minor truncation in the extreme upper tail due to model context window limits. d-f, Probabilistic agreement assessed via density-scaled Bland-Altman plots. Minimal mean bias (0·00) and narrow 95% limits of agreement across unigrams (marginal code probabilities), same-visit co-occurrences, and sequential-visit dependencies confirm that the model reproduces phenotype prevalence and care pathways without systematic bias. g-i, Temporal fidelity evaluation using empirical cumulative distribution functions (ECDFs). Distributional alignment (KS-stats $\leq$0$\cdot$06) across event gaps, length of stay, and inter-visit intervals confirms the model's ability to capture the irregular cadence of healthcare.
  • Figure 2: Preservation of complex clinical phenotypes and syndromic co-occurrences. Co-occurrence heatmaps of the top 150 diagnosis codes in Real (left) and Synthetic (right) cohorts, with axes sorted via hierarchical clustering. High Pearson ($r=0$$\cdot$$93$) and Spearman ($\rho=0$$\cdot$$88$) correlations demonstrate that Coogee generates coherent clinical syndromes rather than isolated codes. Annotated regions highlight the faithful reproduction of higher-order disease mechanisms, including the Cardio-Renal Syndrome and metabolic disease progression.
  • Figure S1: Generating high-fidelity, full-spectrum patient trajectories with Coogee.i, Schematic patient trajectory representation integrating multimodal information: demographics (age, sex, race, marital status, calender year), laboratory tests, diagnoses (ICD-10-CM), procedures (ICD-10-PCS), medications (ATC codes), and structural tokens that encode time gaps and delimiters for records and visits). ii, Knowledge-aware representation of clinical events: each code is linked to its ontology hierarchy (e.g., ICD structure) and mapped to textual descriptions. iii, Transformer-based generative architecture: tokenized medical events are embedded, enriched with knowledge-aware representations, and processed to model longitudinal and multimodal patient trajectories. iv, Example of a synthetic trajectory generated by Coogee: starting from demographics, the model produces diabetes-related codes including labs (glucose, creatinine), diagnoses (type 2 diabetes with CKD), medications (insulin, metformin), and associated procedures, with realistic temporal gaps between visits.
  • Figure S2: Scaling law investigation of model-data trade-offs in Coogee. (a) IsoFLOP curves showing validation loss as a function of model size (parameters) for different computational budgets. Each curve corresponds to a constant training FLOPs constraint, with dashed lines denoting quadratic fits in log–parameter space. (b) Optimal model size (minimizing validation loss per compute budget) follows a power-law relationship with total training FLOPs, consistent with Chinchilla-style scaling trends.hoffmann2022training (c) Corresponding optimal token counts also obey a power-law with compute. Together, these analyses quantify the balance between model capacity and dataset size for efficient training. Based on the identified isoFLOP frontier and available computational resources, we selected the 50M-parameter (13M trainable) configuration for our main experiments.
  • Figure S3: Preservation of population-level statistics: correlations of unigram (marginal), same-visit bigram (co-occurrence), and sequential-visit bigram (temporal) probabilities.
  • ...and 2 more figures