Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data

Salman Ul Hassan Dar; Marvin Seyfarth; Isabelle Ayx; Theano Papavassiliu; Stefan O. Schoenberg; Robert Malte Siepmann; Fabian Christopher Laqua; Jannik Kahmann; Norbert Frey; Bettina Baeßler; Sebastian Foersch; Daniel Truhn; Jakob Nikolas Kather; Sandy Engelhardt

Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data

Salman Ul Hassan Dar, Marvin Seyfarth, Isabelle Ayx, Theano Papavassiliu, Stefan O. Schoenberg, Robert Malte Siepmann, Fabian Christopher Laqua, Jannik Kahmann, Norbert Frey, Bettina Baeßler, Sebastian Foersch, Daniel Truhn, Jakob Nikolas Kather, Sandy Engelhardt

TL;DR

The study assesses the risk of memorizing patient data in unconditional latent diffusion models trained on private medical images and the implications for openly sharing synthetic data. It introduces a self-supervised copy-detection approach to quantify memorized training samples $N_{mem}$ and copies among synthetic outputs $N_{copies}$ across 3D MRI/CT and 2D X-ray datasets, comparing LDMs with non-diffusion baselines. Results show substantial memorization in LDMs, particularly in 3D data, yet LDMs generally yield superior synthesis quality; memorization can be mitigated by data augmentation and smaller model sizes but may trade off realism. The work highlights privacy risks in synthetic data sharing and advocates memorization-aware training and screening before dissemination to ensure patient privacy in medical research workflows.

Abstract

AI models present a wide range of applications in the field of medicine. However, achieving optimal performance requires access to extensive healthcare data, which is often not readily available. Furthermore, the imperative to preserve patient privacy restricts patient data sharing with third parties and even within institutes. Recently, generative AI models have been gaining traction for facilitating open-data sharing by proposing synthetic data as surrogates of real patient data. Despite the promise, some of these models are susceptible to patient data memorization, where models generate patient data copies instead of novel synthetic samples. Considering the importance of the problem, surprisingly it has received relatively little attention in the medical imaging community. To this end, we assess memorization in unconditional latent diffusion models. We train latent diffusion models on CT, MR, and X-ray datasets for synthetic data generation. We then detect the amount of training data memorized utilizing our novel self-supervised copy detection approach and further investigate various factors that can influence memorization. Our findings show a surprisingly high degree of patient data memorization across all datasets. Comparison with non-diffusion generative models, such as autoencoders and generative adversarial networks, indicates that while latent diffusion models are more susceptible to memorization, overall they outperform non-diffusion models in synthesis quality. Further analyses reveal that using augmentation strategies, small architecture, and increasing dataset can reduce memorization while over-training the models can enhance it. Collectively, our results emphasize the importance of carefully training generative models on private medical imaging datasets, and examining the synthetic data to ensure patient privacy before sharing it for medical research and applications.

Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data

TL;DR

and copies among synthetic outputs

across 3D MRI/CT and 2D X-ray datasets, comparing LDMs with non-diffusion baselines. Results show substantial memorization in LDMs, particularly in 3D data, yet LDMs generally yield superior synthesis quality; memorization can be mitigated by data augmentation and smaller model sizes but may trade off realism. The work highlights privacy risks in synthetic data sharing and advocates memorization-aware training and screening before dissemination to ensure patient privacy in medical research workflows.

Abstract

Paper Structure (54 sections, 1 equation, 32 figures, 8 tables, 1 algorithm)

This paper contains 54 sections, 1 equation, 32 figures, 8 tables, 1 algorithm.

Introduction
Prevalence:
Comparison to other Generative Models:
Accurate Detection:
Robust Detection:
Quality of the Detected Copies:
Impact of Training Data Size:
Memorization as a Metric:
Comparison with Traditional Metrics:
Mitigation via Data Augmentation:
Impact of Model Size:
Results
Experimental Settings
Datasets
Generative Models
...and 39 more sections

Figures (32)

Figure 1: Generative models are first trained on private medical data. These models can be used to synthesize novel samples, which can have multiple applications. 1) Open-data sharing: Synthesized samples can be shared publicly for advancing medical imaging research while preserving patient privacy. However, synthesized samples can be patient data replicas, thereby compromising patient privacy. 2) Data Expansion and Diversification - Synthetic samples can be utilized to expand and diversify the training data. Nevertheless, if most of the synthetic samples are patient data replicas, the expansion and diversification is likely to be limited.
Figure 2: Histograms showing distributions of Pearson's correlation values among closest training-validation pairs and training-synthetic pairs in a) PCCTA, b) MRNet, c) fastMRI and d) X-ray datasets. All training, validation, and synthetic samples were projected onto embedding space using self-supervised models. For each training embedding, closest embedding was selected from the validation data denoted as 'Validation' and from the synthetic data for each generative model denoted as 'MedDiff', 'MONAI', 'MONAI-2D', 'CCE-GAN', 'proj-GAN' and 'VQVAE-Trans'. Afterwards, $\tau$ was selected based on the 95th percentile of the correlation values in 'Validation' in each dataset, and synthetic samples with correlation values greater than $\tau$ were classified as copies.
Figure 3: The left column represents private training data, and the right column represents synthesized data. a) Number of memorized training samples ($N_{mem}$) and b) number of synthesized samples that are patient data copies ($N_{copies}$) in PCCTA, MRNet, fastMRI and X-ray datasets as detected by our copy detection pipeline (Section. \ref{['methods:copy_detection']}). All datasets show a high percentage of $N_{mem}$ and $N_{copies}$, notably in 3D datasets.
Figure 4: Representative cross sections of real (Real) and copies (MedDiff, MONAI) detected in the PCCTA dataset. Copies show a high resemblance to the corresponding real samples across all slices.
Figure 5: Representative cross sections of real (Real) and copies (MedDiff, MONAI) detected in the MRNet datasets. Copies show a high resemblance to the corresponding real samples.
...and 27 more figures

Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data

TL;DR

Abstract

Unconditional Latent Diffusion Models Memorize Patient Imaging Data: Implications for Openly Sharing Synthetic Data

Authors

TL;DR

Abstract

Table of Contents

Figures (32)