Table of Contents
Fetching ...

Medical Manifestation-Aware De-Identification

Yuan Tian, Shuo Wang, Guangtao Zhai

TL;DR

This work addresses privacy risks in medical facial imaging by introducing MeMa, a large-scale synthetic dataset of over 40k patient faces with rich medical manifestations generated from real patient data, and MedSem-DeID, a baseline medical-semantics-preserved de-identification method. MeMa enables learning medical priors via a diffusion-based generator and a medical semantics encoder, enabling DeID that preserves diagnosis-relevant cues while protecting identity and enabling reversibility. The approach yields superior medical utility (classification ~86.7%, segmentation Dice ~0.6775) and strong real-world clinical consistency (Cohen's kappa >0.81), while reducing identity leakage and preserving eye-related clinical signals. The release of MeMa and the MeMa-Seg subset provides a benchmark for medical-scene DeID, with practical impact for privacy-preserving medical AI applications and audits, and future work will broaden the scope beyond eye diseases.

Abstract

Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones. Dataset is available at https://github.com/tianyuan168326/MeMa-Pytorch.

Medical Manifestation-Aware De-Identification

TL;DR

This work addresses privacy risks in medical facial imaging by introducing MeMa, a large-scale synthetic dataset of over 40k patient faces with rich medical manifestations generated from real patient data, and MedSem-DeID, a baseline medical-semantics-preserved de-identification method. MeMa enables learning medical priors via a diffusion-based generator and a medical semantics encoder, enabling DeID that preserves diagnosis-relevant cues while protecting identity and enabling reversibility. The approach yields superior medical utility (classification ~86.7%, segmentation Dice ~0.6775) and strong real-world clinical consistency (Cohen's kappa >0.81), while reducing identity leakage and preserving eye-related clinical signals. The release of MeMa and the MeMa-Seg subset provides a benchmark for medical-scene DeID, with practical impact for privacy-preserving medical AI applications and audits, and future work will broaden the scope beyond eye diseases.

Abstract

Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones. Dataset is available at https://github.com/tianyuan168326/MeMa-Pytorch.

Paper Structure

This paper contains 10 sections, 8 figures, 10 tables.

Figures (8)

  • Figure 1: (a) Common DeID approaches, focus on removing identity. (b) Our medical-aware DeID (Med-DeID), also considers preserving the diagnosis-necessary medical information. (c) Our MeMa, a large-scale patient face dataset. (d) Our MeMa-Seg, the tumor segmentation subset of MeMa.
  • Figure 2: Examples and the distribution characteristics of the proposed MeMa dataset.
  • Figure 3: MeMa building pipeline. (a) Training patient face generation model on real patient data. (b) Rich-condition patient face sampling. $P({age})$ and $P({gender})$ denote the age and gender distributions, which are statistically derived from the real patients. 'SD' denotes the stable diffusion model.
  • Figure 4: Comparison of different image generation strategies. We take the Basal Cell Carcinoma (BCC) disease as an example. 'SD' denotes the Stable Diffusion.
  • Figure 5: Overview of the proposed baseline model MedSem-DeID. The snow icon indicates the $Enc_{\text{med}}$ is frozen during training DeID networks. The image decoder after the ID-decryptor is omitted for briefness. $\oplus$ denotes the channel-wise concatenation operation.
  • ...and 3 more figures