Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

Lavender Y. Jiang; Xujin Chris Liu; Kyunghyun Cho; Eric K. Oermann

Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

Lavender Y. Jiang, Xujin Chris Liu, Kyunghyun Cho, Eric K. Oermann

TL;DR

This work argues that HIPAA Safe Harbor is increasingly insufficient for protecting patient privacy in the age of large language models. It formalizes privacy leakage with a causal-graph model linking identity, protected attributes, medical information, non-sensitive attributes, and observed notes, and demonstrates that de-identified notes retain nontrivial re-identification risk. Through an end-to-end re-identification attack on a large hospital notes dataset, the authors show that attribute inference and patient linkage exceed random baselines, with measurable yet scale-dependent risks. The study advocates risk-aware governance—tiered access, transparency, and utility checks—alongside research directions that go beyond engineering fixes, including differential privacy, synthetic data, and clearer data provenance. The findings highlight the practical and ethical urgency of rethinking data sharing in healthcare to preserve patient trust while enabling AI-driven advances.

Abstract

Privacy is a human right that sustains patient-provider trust. Clinical notes capture a patient's private vulnerability and individuality, which are used for care coordination and research. Under HIPAA Safe Harbor, these notes are de-identified to protect patient privacy. However, Safe Harbor was designed for an era of categorical tabular data, focusing on the removal of explicit identifiers while ignoring the latent information found in correlations between identity and quasi-identifiers, which can be captured by modern LLMs. We first formalize these correlations using a causal graph, then validate it empirically through individual re-identification of patients from scrubbed notes. The paradox of de-identification is further shown through a diagnosis ablation: even when all other information is removed, the model can predict the patient's neighborhood based on diagnosis alone. This position paper raises the question of how we can act as a community to uphold patient-provider trust when de-identification is inherently imperfect. We aim to raise awareness and discuss actionable recommendations.

Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

TL;DR

Abstract

Paper Structure (38 sections, 10 figures, 2 tables)

This paper contains 38 sections, 10 figures, 2 tables.

Introduction
Related Work
Linkage Attacks and Regulatory Origins
Critiques of De-Identification
Technical Defenses and Legal Frontiers
The paradox from causal graphs
Note Generation
De-identification
Examples
The Paradox
Empirical Evidence from Re-identification
Experiment Setup
Data.
Finetuning
Prediction Evaluation.
...and 23 more sections

Figures (10)

Figure 1: Linkage attack of texts is more challenging yet possible with modern language models.
Figure 2: The current de-identification frameworks (purple) leaves two backdoors open for linkage attacks via non-sensitive (blue) and medical (yellow) information.
Figure 3: Re-identification process overview. As a simplified example, a single de-identified notes is passed through a LM for DOB and ZIP prediction. The predictions are used to match patients in a candidate database, resulting in a candidate set $\Omega$. In this case, the re-identification risk is zero if either of the prediction is wrong. Otherwise, the risk is $1/|\Omega|$ based on random draw from the correctly identified group.
Figure 4: Attribute prediction exceeds random chance using as few as 1k examples. LM's accuracy (red bars) is better than random guess (grey bars) and improve with more data. AUCs shows a similar above-random pattern whose gap increases with more data (Appendix \ref{['apd:auc_bar']}).
Figure 5: Individual re-identification risk significantly exceeds random guess. The scatter plot shows the relationship between group re-identification probability($\mathbb{P}_G$, $x$ axis) and the the probability of identifying the correct individual within that group ($\mathbb{P}_{I\mid G}\approx 1 / |\Omega|$, $y$ axis). More yellow points toward the upper right represents higher overall re-identification risks. LM predictions (circles) consistently outperform majority class guesses (crosses).
...and 5 more figures

Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

TL;DR

Abstract

Paradox of De-identification: A Critique of HIPAA Safe Harbour in the Age of LLMs

Authors

TL;DR

Abstract

Table of Contents

Figures (10)