Table of Contents
Fetching ...

Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification?

Veysel Kocaman, Muhammed Santas, Yigit Gul, Mehmet Butgul, David Talby

TL;DR

This study benchmarks four leading de-identification solutions (John Snow Labs Healthcare NLP Library, Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o) on a ground-truth set of 48 expert-annotated clinical notes to detect PHI. It reports that the John Snow Labs solution delivers the highest PHI-detection accuracy, achieving regulatory-grade performance that exceeds human benchmarks, while also being the most cost-effective due to a fixed local deployment model. Cloud-based APIs show competitive accuracy but offer limited customization and incur escalating per-request costs. The results provide practical guidance for researchers and healthcare organizations on selecting scalable, accurate, and economical de-identification tools, and introduce a publicly available benchmark dataset to support ongoing evaluation.

Abstract

We evaluate the performance of four leading solutions for de-identification of unstructured medical text - Azure Health Data Services, AWS Comprehend Medical, OpenAI GPT-4o, and John Snow Labs - on a ground truth dataset of 48 clinical documents annotated by medical experts. The analysis, conducted at both entity-level and token-level, suggests that John Snow Labs' Medical Language Models solution achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, outperforming Azure (91%), AWS (83%), and GPT-4o (79%). John Snow Labs is not only the only solution which achieves regulatory-grade accuracy (surpassing that of human experts) but is also the most cost-effective solution: It is over 80% cheaper compared to Azure and GPT-4o, and is the only solution not priced by token. Its fixed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice.

Can Zero-Shot Commercial APIs Deliver Regulatory-Grade Clinical Text DeIdentification?

TL;DR

This study benchmarks four leading de-identification solutions (John Snow Labs Healthcare NLP Library, Azure Health Data Services, AWS Comprehend Medical, and OpenAI GPT-4o) on a ground-truth set of 48 expert-annotated clinical notes to detect PHI. It reports that the John Snow Labs solution delivers the highest PHI-detection accuracy, achieving regulatory-grade performance that exceeds human benchmarks, while also being the most cost-effective due to a fixed local deployment model. Cloud-based APIs show competitive accuracy but offer limited customization and incur escalating per-request costs. The results provide practical guidance for researchers and healthcare organizations on selecting scalable, accurate, and economical de-identification tools, and introduce a publicly available benchmark dataset to support ongoing evaluation.

Abstract

We evaluate the performance of four leading solutions for de-identification of unstructured medical text - Azure Health Data Services, AWS Comprehend Medical, OpenAI GPT-4o, and John Snow Labs - on a ground truth dataset of 48 clinical documents annotated by medical experts. The analysis, conducted at both entity-level and token-level, suggests that John Snow Labs' Medical Language Models solution achieves the highest accuracy, with a 96% F1-score in protected health information (PHI) detection, outperforming Azure (91%), AWS (83%), and GPT-4o (79%). John Snow Labs is not only the only solution which achieves regulatory-grade accuracy (surpassing that of human experts) but is also the most cost-effective solution: It is over 80% cheaper compared to Azure and GPT-4o, and is the only solution not priced by token. Its fixed-cost local deployment model avoids the escalating per-request fees of cloud-based services, making it a scalable and economical choice.

Paper Structure

This paper contains 18 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: De-Identification process identifies potential pieces of content with personal information about patients and removes them by replacing them with semantic tags or fake entities.
  • Figure 2: Visualization of the F1-Scores for each label
  • Figure A1: Entity Level Evaluation
  • Figure A2: Token Level Evaluation
  • Figure A3: De-identification Results of the Tools on a Sample Text
  • ...and 1 more figures