Medical Manifestation-Aware De-Identification
Yuan Tian, Shuo Wang, Guangtao Zhai
TL;DR
This work addresses privacy risks in medical facial imaging by introducing MeMa, a large-scale synthetic dataset of over 40k patient faces with rich medical manifestations generated from real patient data, and MedSem-DeID, a baseline medical-semantics-preserved de-identification method. MeMa enables learning medical priors via a diffusion-based generator and a medical semantics encoder, enabling DeID that preserves diagnosis-relevant cues while protecting identity and enabling reversibility. The approach yields superior medical utility (classification ~86.7%, segmentation Dice ~0.6775) and strong real-world clinical consistency (Cohen's kappa >0.81), while reducing identity leakage and preserving eye-related clinical signals. The release of MeMa and the MeMa-Seg subset provides a benchmark for medical-scene DeID, with practical impact for privacy-preserving medical AI applications and audits, and future work will broaden the scope beyond eye diseases.
Abstract
Face de-identification (DeID) has been widely studied for common scenes, but remains under-researched for medical scenes, mostly due to the lack of large-scale patient face datasets. In this paper, we release MeMa, consisting of over 40,000 photo-realistic patient faces. MeMa is re-generated from massive real patient photos. By carefully modulating the generation and data-filtering procedures, MeMa avoids breaching real patient privacy, while ensuring rich and plausible medical manifestations. We recruit expert clinicians to annotate MeMa with both coarse- and fine-grained labels, building the first medical-scene DeID benchmark. Additionally, we propose a baseline approach for this new medical-aware DeID task, by integrating data-driven medical semantic priors into the DeID procedure. Despite its conciseness and simplicity, our approach substantially outperforms previous ones. Dataset is available at https://github.com/tianyuan168326/MeMa-Pytorch.
