The Data Sharing Paradox of Synthetic Data in Healthcare
Jim Achterberg, Bram van Dijk, Saif ul Islam, Hafiz Muhammad Waseem, Parisis Gallos, Gregory Epiphaniou, Carsten Maple, Marcel Haas, Marco Spruit
TL;DR
This paper analyzes why synthetic data in healthcare, though designed to enable sharing, remains hindered by misaligned privacy metrics and regulatory requirements. It classifies privacy metrics into attribute disclosure and membership disclosure, examining how each relates to reidentification risk in SD pipelines. It discusses challenges such as zero-risk versus acceptable-risk legal standards, absence of universal thresholds, and healthcare-specific data complexities, along with ethical considerations. The authors propose context-aware privacy metrics, explainable assessments, and cross-disciplinary knowledge exchange as practical steps to reconcile privacy guarantees with data-sharing needs, aiming to accelerate safe SD adoption in healthcare.
Abstract
Synthetic data offers a promising solution to privacy concerns in healthcare by generating useful datasets in a privacy-aware manner. However, although synthetic data is typically developed with the intention of sharing said data, ambiguous reidentification risk assessments often prevent synthetic data from seeing the light of day. One of the main causes is that privacy metrics for synthetic data, which inform on reidentification risks, are not well-aligned with practical requirements and regulations regarding data sharing in healthcare. This article discusses the paradoxical situation where synthetic data is designed for data sharing but is often still restricted. We also discuss how the field should move forward to mitigate this issue.
