Table of Contents
Fetching ...

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

Yuxin Xiao, Shulammite Lim, Tom Joseph Pollard, Marzyeh Ghassemi

TL;DR

This study exposes systematic bias in clinical name de-identification across demographic groups by conducting a large-scale, name-centric evaluation of nine methods over 16 name sets and 100 real clinical templates. It reveals statistically significant gaps in recall across gender, race, name popularity, and decade, and identifies polysemy and context-gender mismatches as key underpinnings of underperformance. The authors demonstrate a practical mitigation: fine-tuning de-identification systems with clinical context and diverse names, which improves overall recall and reduces bias in a method-agnostic fashion. The work underscores the urgent need for audits and inclusive data practices to ensure equitable privacy protection and reproducibility in healthcare research.

Abstract

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.

In the Name of Fairness: Assessing the Bias in Clinical Record De-identification

TL;DR

This study exposes systematic bias in clinical name de-identification across demographic groups by conducting a large-scale, name-centric evaluation of nine methods over 16 name sets and 100 real clinical templates. It reveals statistically significant gaps in recall across gender, race, name popularity, and decade, and identifies polysemy and context-gender mismatches as key underpinnings of underperformance. The authors demonstrate a practical mitigation: fine-tuning de-identification systems with clinical context and diverse names, which improves overall recall and reduces bias in a method-agnostic fashion. The work underscores the urgent need for audits and inclusive data practices to ensure equitable privacy protection and reproducibility in healthcare research.

Abstract

Data sharing is crucial for open science and reproducible research, but the legal sharing of clinical data requires the removal of protected health information from electronic health records. This process, known as de-identification, is often achieved through the use of machine learning algorithms by many commercial and open-source systems. While these systems have shown compelling results on average, the variation in their performance across different demographic groups has not been thoroughly examined. In this work, we investigate the bias of de-identification systems on names in clinical notes via a large-scale empirical analysis. To achieve this, we create 16 name sets that vary along four demographic dimensions: gender, race, name popularity, and the decade of popularity. We insert these names into 100 manually curated clinical templates and evaluate the performance of nine public and private de-identification methods. Our findings reveal that there are statistically significant performance gaps along a majority of the demographic dimensions in most methods. We further illustrate that de-identification quality is affected by polysemy in names, gender context, and clinical note characteristics. To mitigate the identified gaps, we propose a simple and method-agnostic solution by fine-tuning de-identification methods with clinical context and diverse names. Overall, it is imperative to address the bias in existing methods immediately so that downstream stakeholders can build high-quality systems to serve all demographic parties fairly.
Paper Structure (36 sections, 11 figures, 6 tables)

This paper contains 36 sections, 11 figures, 6 tables.

Figures (11)

  • Figure 1: Workflow of our empirical study. We identify (a) four demographic dimensions and prepare (b) 16 name sets with diverse settings. For each name set, we duplicate each of the (c) 100 clinical templates ten times and populate the copies with randomly generated names. We then use these (d) 16000 evaluation notes to assess (e) nine de-identification methods.
  • Figure 2: Recall and 95% bootstrapped confidence interval of the demographic groups along each dimension by each examined de-identification method. Disparities in performance between different groups are more obvious along the dimensions of race and popularity than along the dimensions of gender and decade.
  • Figure 3: Average recall and standard error of each name set by the examined de-identification methods, ordered by decreasing recall. The average recall on name sets with top popularity exceeds the other sets by a clear margin. Moreover, the methods are, on average, more capable of recognizing less popular names associated with the White racial group compared to more popular names associated with the Asian racial group.
  • Figure 4: Recall and 95% bootstrapped confidence interval on polysemy first names associated with three racial groups by each examined de-identification method. The recall ranking among the three groups remains relatively consistent for most methods as that based on the original setting in Figure \ref{['fig:dimension_recall']} (b). The increase in recall illustrated by the lighter color bar refers to the partially correct de-identification of non-polysemy last names.
  • Figure 5: Difference in recall and 95% bootstrapped confidence interval between names that are consistent and inconsistent with the genders suggested by the local context. A positive recall difference means that performance was best when there was gender consistency, while a negative recall difference means that performance was best when there was gender inconsistency. Methods leveraging the gender context for name recognition are expected to see a positive recall difference.
  • ...and 6 more figures