Table of Contents
Fetching ...

Privacy-Preserving Statistical Data Generation: Application to Sepsis Detection

Eric Macias-Fassio, Aythami Morales, Cristina Pruenza, Julian Fierrez

TL;DR

This work tackles privacy constraints in biomedical AI by introducing KDE-KNN, a KDE-based method that generates privacy-preserving synthetic tabular data for sepsis detection. The authors rigorously compare KDE-KNN against SMOTE and TabDDPM, using two real ICU/ED datasets (MaDB and SLDB) and multiple ML models (RF, SVM with linear and RBF kernels), with external validation showing robust generalization. Their results show synthetic data—especially when balanced via KDE-KNN—can improve predictive performance (AUC) and maintain privacy, as evidenced by higher Distance to Closest Record values compared to real data. The study demonstrates practical implications for compliant data sharing and robust model training in regulated biomedical settings, highlighting KDE-KNN as a viable tool for privacy-preserving data augmentation in sepsis detection. $AUC$ values reach up to $0.7682$ on the external cohort, underscoring the method’s potential for real-world deployment.

Abstract

The biomedical field is among the sectors most impacted by the increasing regulation of Artificial Intelligence (AI) and data protection legislation, given the sensitivity of patient information. However, the rise of synthetic data generation methods offers a promising opportunity for data-driven technologies. In this study, we propose a statistical approach for synthetic data generation applicable in classification problems. We assess the utility and privacy implications of synthetic data generated by Kernel Density Estimator and K-Nearest Neighbors sampling (KDE-KNN) within a real-world context, specifically focusing on its application in sepsis detection. The detection of sepsis is a critical challenge in clinical practice due to its rapid progression and potentially life-threatening consequences. Moreover, we emphasize the benefits of KDE-KNN compared to current synthetic data generation methodologies. Additionally, our study examines the effects of incorporating synthetic data into model training procedures. This investigation provides valuable insights into the effectiveness of synthetic data generation techniques in mitigating regulatory constraints within the biomedical field.

Privacy-Preserving Statistical Data Generation: Application to Sepsis Detection

TL;DR

This work tackles privacy constraints in biomedical AI by introducing KDE-KNN, a KDE-based method that generates privacy-preserving synthetic tabular data for sepsis detection. The authors rigorously compare KDE-KNN against SMOTE and TabDDPM, using two real ICU/ED datasets (MaDB and SLDB) and multiple ML models (RF, SVM with linear and RBF kernels), with external validation showing robust generalization. Their results show synthetic data—especially when balanced via KDE-KNN—can improve predictive performance (AUC) and maintain privacy, as evidenced by higher Distance to Closest Record values compared to real data. The study demonstrates practical implications for compliant data sharing and robust model training in regulated biomedical settings, highlighting KDE-KNN as a viable tool for privacy-preserving data augmentation in sepsis detection. values reach up to on the external cohort, underscoring the method’s potential for real-world deployment.

Abstract

The biomedical field is among the sectors most impacted by the increasing regulation of Artificial Intelligence (AI) and data protection legislation, given the sensitivity of patient information. However, the rise of synthetic data generation methods offers a promising opportunity for data-driven technologies. In this study, we propose a statistical approach for synthetic data generation applicable in classification problems. We assess the utility and privacy implications of synthetic data generated by Kernel Density Estimator and K-Nearest Neighbors sampling (KDE-KNN) within a real-world context, specifically focusing on its application in sepsis detection. The detection of sepsis is a critical challenge in clinical practice due to its rapid progression and potentially life-threatening consequences. Moreover, we emphasize the benefits of KDE-KNN compared to current synthetic data generation methodologies. Additionally, our study examines the effects of incorporating synthetic data into model training procedures. This investigation provides valuable insights into the effectiveness of synthetic data generation techniques in mitigating regulatory constraints within the biomedical field.
Paper Structure (15 sections, 4 figures, 5 tables)

This paper contains 15 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: KNN-KDE synthetic method block diagram including the generation modules based on two Kernel Density Estimators (Sepsis and Non-Sepsis) and K-NN sampling.
  • Figure 2: Distribution of $4$ features from the two real datasets and the synthetic dataset. The solid black line represents the data distribution from the SLDB, the grey line represents the distribution from MaDB and the dashed line represents the distribution of synthetic data generated by KDE-KNN. All features were normalized using a z-score normalization technique.
  • Figure 3: Compromise between privacy and realism of synthetic samples. The graphs represent the distance between real and synthetic samples in a conceptual 2-dimensional space.
  • Figure 4: Probability distribution of the Distance to Closest Record (DCR) for real samples and synthetic samples generated with the $3$ generation approaches evaluated in our experiments.