Table of Contents
Fetching ...

Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

Noopur Zambare, Kiana Aghakasiri, Carissa Lin, Carrie Ye, J. Ross Mitchell, Mohamed Abdalla

TL;DR

This work systematically assesses efficiency and generalizability in clinical de-identification across model families, from BERT variants to small and large LLMs. It shows that smaller, well-fine-tuned models can match large models in performance while drastically reducing inference costs, and introduces BERT-MultiCulture-DEID to improve cross-language robustness. The study highlights substantial cross-format and cross-language challenges, revealing that large LLMs are often impractical for deployment despite robustness, and demonstrates a practical path forward through targeted fine-tuning for multi-cultural identifiers. The findings offer actionable guidance for fair, efficient clinical de-identification in real-world settings and provide publicly releaseable models and benchmarks to advance the field.

Abstract

Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291

Towards Fair and Efficient De-identification: Quantifying the Efficiency and Generalizability of De-identification Approaches

TL;DR

This work systematically assesses efficiency and generalizability in clinical de-identification across model families, from BERT variants to small and large LLMs. It shows that smaller, well-fine-tuned models can match large models in performance while drastically reducing inference costs, and introduces BERT-MultiCulture-DEID to improve cross-language robustness. The study highlights substantial cross-format and cross-language challenges, revealing that large LLMs are often impractical for deployment despite robustness, and demonstrates a practical path forward through targeted fine-tuning for multi-cultural identifiers. The findings offer actionable guidance for fair, efficient clinical de-identification in real-world settings and provide publicly releaseable models and benchmarks to advance the field.

Abstract

Large language models (LLMs) have shown strong performance on clinical de-identification, the task of identifying sensitive identifiers to protect privacy. However, previous work has not examined their generalizability between formats, cultures, and genders. In this work, we systematically evaluate fine-tuned transformer models (BERT, ClinicalBERT, ModernBERT), small LLMs (Llama 1-8B, Qwen 1.5-7B), and large LLMs (Llama-70B, Qwen-72B) at de-identification. We show that smaller models achieve comparable performance while substantially reducing inference cost, making them more practical for deployment. Moreover, we demonstrate that smaller models can be fine-tuned with limited data to outperform larger models in de-identifying identifiers drawn from Mandarin, Hindi, Spanish, French, Bengali, and regional variations of English, in addition to gendered names. To improve robustness in multi-cultural contexts, we introduce and publicly release BERT-MultiCulture-DEID, a set of de-identification models based on BERT, ClinicalBERT, and ModernBERT, fine-tuned on MIMIC with identifiers from multiple language variants. Our findings provide the first comprehensive quantification of the efficiency-generalizability trade-off in de-identification and establish practical pathways for fair and efficient clinical de-identification. Details on accessing the models are available at: https://doi.org/10.5281/zenodo.18342291
Paper Structure (32 sections, 9 figures, 9 tables)

This paper contains 32 sections, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Inference time computation of different de-identification models. The y-axis shows words processed per second, and the x-axis shows recall. Results are grouped by model type: BERT variants, smaller Qwen(1.5-7B)/Llama(1-8B) models, large Qwen-72B/Llama-70B models, and rule-based models.
  • Figure 2: Setup of generalization testing. (Ex. 2.1) Cross-format testing: Models were fine-tuned on MIMIC and tested on both MIMIC and PCD, and vice versa, fine-tuned on PCD and tested on both datasets. (Ex. 2.2) Multi-cultural testing: Models fine-tuned on MIMIC notes with US English identifiers were tested on notes with identifiers from different language variants. (Ex. 2.3) Performance disparity across gendered names: Notes with identifiers from different language variants were used, and recall was evaluated specifically for name identifiers.
  • Figure 3: Experiment 2.2.1: Recall of de-identification models fine-tuned on 1000 MIMIC-III and tested on 500 samples from five languages. P: Precision, R: Recall. Full results in Appendix \ref{['app:full_results']}.
  • Figure 4: Relative difference in recall of the same model tested on US identifiers versus other languages. The models is fine-tuned on 1,000 samples with US English identifiers and evaluated on 500 samples for each language.
  • Figure 5: Distribution of PII entities in the testing data. The left y-axis shows the total count of each PII type in the test set, and the right y-axis shows the percentage of missed identifiers out of all missed identifiers during de-identification by the best-performing model BERT.
  • ...and 4 more figures