Towards Fairer Health Recommendations: finding informative unbiased samples via Word Sense Disambiguation
Gavin Butts, Pegah Emdad, Jethro Lee, Shannon Song, Chiman Salavati, Willmar Sosa Diaz, Shiri Dori-Hacohen, Fabricio Murai
TL;DR
This work tackles bias in medical curricula and health AI by shifting focus to data quality through Word Sense Disambiguation (WSD). Building on the BRICC dataset, the authors refine negative samples to reduce race/ethnicity ambiguities and evaluate bias detectors using both fine-tuned transformers (e.g., RoBERTa, DistilBERT, BioBERT, TinyLlama) and Large Language Models with zero-/few-shot prompting. They show that WSD substantially improves precision and F1 for bias detection, while LLMs underperform relative to strong transformer baselines, highlighting a data-centric path to fairness in healthcare NLP. The results suggest that combining WSD-filtered negatives with robust BERT-based classifiers yields practical gains for high-recall bias detection, enabling more equitable medical education content and safer deployment of health recommender systems.
Abstract
There have been growing concerns around high-stake applications that rely on models trained with biased data, which consequently produce biased predictions, often harming the most vulnerable. In particular, biased medical data could cause health-related applications and recommender systems to create outputs that jeopardize patient care and widen disparities in health outcomes. A recent framework titled Fairness via AI posits that, instead of attempting to correct model biases, researchers must focus on their root causes by using AI to debias data. Inspired by this framework, we tackle bias detection in medical curricula using NLP models, including LLMs, and evaluate them on a gold standard dataset containing 4,105 excerpts annotated by medical experts for bias from a large corpus. We build on previous work by coauthors which augments the set of negative samples with non-annotated text containing social identifier terms. However, some of these terms, especially those related to race and ethnicity, can carry different meanings (e.g., "white matter of spinal cord"). To address this issue, we propose the use of Word Sense Disambiguation models to refine dataset quality by removing irrelevant sentences. We then evaluate fine-tuned variations of BERT models as well as GPT models with zero- and few-shot prompting. We found LLMs, considered SOTA on many NLP tasks, unsuitable for bias detection, while fine-tuned BERT models generally perform well across all evaluated metrics.
