Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs
Lavender Y. Jiang, Xujin Chris Liu, Kyunghyun Cho, Eric K. Oermann
TL;DR
This work argues that HIPAA Safe Harbor is increasingly insufficient for protecting patient privacy in the age of large language models. It formalizes privacy leakage with a causal-graph model linking identity, protected attributes, medical information, non-sensitive attributes, and observed notes, and demonstrates that de-identified notes retain nontrivial re-identification risk. Through an end-to-end re-identification attack on a large hospital notes dataset, the authors show that attribute inference and patient linkage exceed random baselines, with measurable yet scale-dependent risks. The study advocates risk-aware governance—tiered access, transparency, and utility checks—alongside research directions that go beyond engineering fixes, including differential privacy, synthetic data, and clearer data provenance. The findings highlight the practical and ethical urgency of rethinking data sharing in healthcare to preserve patient trust while enabling AI-driven advances.
Abstract
Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.
