Table of Contents
Fetching ...

Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease

Benjamin D. Ballyk, Ankit Gupta, Sujay Konda, Kavitha Subramanian, Chris Landon, Ahmed Ammar Naseer, Georg Maierhofer, Sumanth Swaminathan, Vasudevan Venkateshwaran

TL;DR

The study tackles the privacy barrier to deploying ML on longitudinal EHR data by developing DP-TimeGAN, a differentially private extension of TimeGAN, and evaluating it alongside a non-private Augmented TimeGAN. The authors introduce discriminator noise injection and assess an xLSTM option, while enforcing privacy via gradient clipping and noise within a Renyi-DP accounting framework. Through statistical metrics, TSTR-based utility, and blinded clinician validation on sine, eICU, and CKD datasets, they show DP-TimeGAN achieves strong privacy guarantees with competitive clinical realism and downstream utility, particularly in CKD contexts. This work enables safer data sharing and robust ML testing for chronic disease modeling, advancing privacy-preserving synthetic EHR generation for real-world clinical workflows.

Abstract

Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real (TSTR) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.

Privacy-Preserving Generative Modeling and Clinical Validation of Longitudinal Health Records for Chronic Disease

TL;DR

The study tackles the privacy barrier to deploying ML on longitudinal EHR data by developing DP-TimeGAN, a differentially private extension of TimeGAN, and evaluating it alongside a non-private Augmented TimeGAN. The authors introduce discriminator noise injection and assess an xLSTM option, while enforcing privacy via gradient clipping and noise within a Renyi-DP accounting framework. Through statistical metrics, TSTR-based utility, and blinded clinician validation on sine, eICU, and CKD datasets, they show DP-TimeGAN achieves strong privacy guarantees with competitive clinical realism and downstream utility, particularly in CKD contexts. This work enables safer data sharing and robust ML testing for chronic disease modeling, advancing privacy-preserving synthetic EHR generation for real-world clinical workflows.

Abstract

Data privacy is a critical challenge in modern medical workflows as the adoption of electronic patient records has grown rapidly. Stringent data protection regulations limit access to clinical records for training and integrating machine learning models that have shown promise in improving diagnostic accuracy and personalized care outcomes. Synthetic data offers a promising alternative; however, current generative models either struggle with time-series data or lack formal privacy guaranties. In this paper, we enhance a state-of-the-art time-series generative model to better handle longitudinal clinical data while incorporating quantifiable privacy safeguards. Using real data from chronic kidney disease and ICU patients, we evaluate our method through statistical tests, a Train-on-Synthetic-Test-on-Real (TSTR) setup, and expert clinical review. Our non-private model (Augmented TimeGAN) outperforms transformer- and flow-based models on statistical metrics in several datasets, while our private model (DP-TimeGAN) maintains a mean authenticity of 0.778 on the CKD dataset, outperforming existing state-of-the-art models on the privacy-utility frontier. Both models achieve performance comparable to real data in clinician evaluations, providing robust input data necessary for developing models for complex chronic conditions without compromising data privacy.

Paper Structure

This paper contains 29 sections, 13 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: (A) Current workflow when handling protected patient data within the clinic. (B) Proposed downstream model pipeline for generic secure patient evaluation with machine learning models.
  • Figure 2: Model architecture for DP-TimeGAN. The model consists of five recurrent networks: embedding ($\mathcal{E}$), recovery ($\mathcal{R}$), supervisor ($\mathcal{S}$), generator ($G$), and discriminator ($D$). Real sequences $\mathbf{x}_{1:T}$ are mapped to latent space as $\mathbf{h}_{1:T} = \mathcal{E}(\mathbf{x}_{1:T})$. The generator produces latent sequences $\mathbf{\hat{e}}_{1:T} = G(\mathbf{z}_{1:T})$ from random noise, which are refined by the supervisor into supervised embeddings $\mathbf{\hat{h}}_{2:T+1} = \mathcal{S}(\mathbf{\hat{e}}_{1:T})$. The recovery network maps latent sequences back to data space, yielding $\mathbf{\tilde{x}}_{1:T} = \mathcal{R}(\mathbf{h}_{1:T})$, and the discriminator outputs $\hat{y}\in [0,1]$ as the classification of latent sequences for adversarial training.
  • Figure 3: Real and synthetic eGFR trajectories for patients with chronic kidney disease (CKD). CKD stages are shaded in order of severity, labeled on the right. Data has shape ($N$, $T$, $C$) = (421, 7, 7); (b), (c), and (d) use parameters: $\# \text{epochs}= 10000$, $\# \text{layers} = 3$, $\text{latent-dim} = 24$, $\gamma=1$. For DP, ($\varepsilon$, $\delta$) = (10, $10^{-5}$).
  • Figure 4: Comparison of real and synthetic sinusoidal data from the Augmented TimeGAN. Real data uses ($N$, $T$, $C$) = (700, 24, 5), where each feature is a randomly generated sine wave; training parameters are: $\# \text{epochs} = 6000$, $\# \text{layers} = 3$ and $\text{latent-dim} = 24$. All data is normalized to a starting value of 1 prior to plotting for clarity. Plots (a) and (b) isolate one feature of real and synthetic sine waves, respectively; plots (c) and (d) compare the PCA and t-SNE results, respectively, for the two datasets.
  • Figure 5: Sample test patient from blinded clinician evaluation.
  • ...and 4 more figures