Table of Contents
Fetching ...

EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models

Hongyi Yuan, Songchi Zhou, Sheng Yu

TL;DR

The paper tackles the challenge of limited publicly available EHR data due to privacy concerns by introducing EHRDiff, a diffusion-model-based framework for unconditional EHR synthesis. By formulating EHR generation through forward and reverse diffusion processes and a pre-conditioned denoiser, EHRDiff achieves high-fidelity synthetic data while balancing privacy risks. Across MIMIC-III and additional CinC2012/PTB-ECG datasets, it attains state-of-the-art utility metrics and competitive privacy performance, outperforming GAN-based baselines in diversity and feature correlations and enabling strong downstream predictive performance. This work demonstrates a practical path for generating realistic, privacy-preserving EHR data to accelerate biomedical methodology development and evaluation.

Abstract

Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.

EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models

TL;DR

The paper tackles the challenge of limited publicly available EHR data due to privacy concerns by introducing EHRDiff, a diffusion-model-based framework for unconditional EHR synthesis. By formulating EHR generation through forward and reverse diffusion processes and a pre-conditioned denoiser, EHRDiff achieves high-fidelity synthetic data while balancing privacy risks. Across MIMIC-III and additional CinC2012/PTB-ECG datasets, it attains state-of-the-art utility metrics and competitive privacy performance, outperforming GAN-based baselines in diversity and feature correlations and enabling strong downstream predictive performance. This work demonstrates a practical path for generating realistic, privacy-preserving EHR data to accelerate biomedical methodology development and evaluation.

Abstract

Electronic health records (EHR) contain a wealth of biomedical information, serving as valuable resources for the development of precision medicine systems. However, privacy concerns have resulted in limited access to high-quality and large-scale EHR data for researchers, impeding progress in methodological development. Recent research has delved into synthesizing realistic EHR data through generative modeling techniques, where a majority of proposed methods relied on generative adversarial networks (GAN) and their variants for EHR synthesis. Despite GAN-based methods attaining state-of-the-art performance in generating EHR data, these approaches are difficult to train and prone to mode collapse. Recently introduced in generative modeling, diffusion models have established cutting-edge performance in image generation, but their efficacy in EHR data synthesis remains largely unexplored. In this study, we investigate the potential of diffusion models for EHR data synthesis and introduce a novel method, EHRDiff. Through extensive experiments, EHRDiff establishes new state-of-the-art quality for synthetic EHR data, protecting private information in the meanwhile.
Paper Structure (41 sections, 10 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 41 sections, 10 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The dimension-wise probability scatter plot of synthetic EHR data from different generative models against real EHR data. The diagonal lines represent the perfect match of code prevalence between synthetic and real EHR data.
  • Figure 2: The dimension-wise prediction scatter plot of synthetic EHR data from different generative models against real EHR data. The diagonal lines represent the perfect match of code prediction between synthetic and real EHR data. Each scatter represents a task.
  • Figure 3: The line plots for CinC2012 and PTB-ECG with different data scales. The green star represents the performance of the model trained on real data.
  • Figure 4: The histograms plot the empirical distributions of the unique code counts on the sample level. The solid lines are the kernel density estimations of the distribution.