Generative clinical time series models trained on moderate amounts of patient data are privacy preserving
Rustam Zhumagambetov, Niklas Giesa, Sebastian D. Boie, Stefan Haufe
TL;DR
This study addresses privacy risks in generative clinical time-series models trained on public ICU data. It evaluates four state-of-the-art generative models (GHOSTS, HALO, KoVAE, Diffusion-TS) against a battery of membership inference attacks using MIMIC-IV and eICU data, showing that with training sizes above ~500 samples the synthetic outputs are largely resistant to such attacks. The work also demonstrates that differential privacy, while theoretically protective, can reduce utility and does not reliably enhance privacy in this context, and that cross-dataset attacks can exploit shared physiological patterns to undermine privacy. The authors propose a framework for ex-post privacy auditing that quantifies privacy risk via multiple attack modalities and metrics, and they advocate integrating such audits into model validation and data governance workflows to enable safer data sharing for research.
Abstract
Sharing medical data for machine learning model training purposes is often impossible due to the risk of disclosing identifying information about individual patients. Synthetic data produced by generative artificial intelligence (genAI) models trained on real data is often seen as one possible solution to comply with privacy regulations. While powerful genAI models for heterogeneous hospital time series have recently been introduced, such modeling does not guarantee privacy protection, as the generated data may still reveal identifying information about individuals in the models' training cohort. Applying established privacy mechanisms to generative time series models, however, proves challenging as post-hoc data anonymization through k-anonymization or similar techniques is limited, while model-centered privacy mechanisms that implement differential privacy (DP) may lead to unstable training, compromising the utility of generated data. Given these known limitations, privacy audits for generative time series models are currently indispensable regardless of the concrete privacy mechanisms applied to models and/or data. In this work, we use a battery of established privacy attacks to audit state-of-the-art hospital time series models, trained on the public MIMIC-IV dataset, with respect to privacy preservation. Furthermore, the eICU dataset was used to mount a privacy attack against the synthetic data generator trained on the MIMIC-IV dataset. Results show that established privacy attacks are ineffective against generated multivariate clinical time series when synthetic data generators are trained on large enough training datasets. Furthermore, we discuss how the use of existing DP mechanisms for these synthetic data generators would not bring desired improvement in privacy, but only a decrease in utility for machine learning prediction tasks.
