Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Isotta Landi; Eugenia Alleva; Nicole Bussola; Rebecca M. Cohen; Sarah Nowlin; Leslee J. Shaw; Alexander W. Charney; Kimberly B. Glazer

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Isotta Landi, Eugenia Alleva, Nicole Bussola, Rebecca M. Cohen, Sarah Nowlin, Leslee J. Shaw, Alexander W. Charney, Kimberly B. Glazer

TL;DR

It is demonstrated that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance.

Abstract

Clinical documentation can contain emotionally charged language with stigmatizing or privileging valences. We present a framework for detecting and classifying such language as stigmatizing, privileging, or neutral. We constructed a curated lexicon of biased terms scored for emotional valence. We then used lexicon-based matching to extract text chunks from OB-GYN delivery notes (Mount Sinai Hospital, NY) and MIMIC-IV discharge summaries across multiple specialties. Three clinicians annotated all chunks, enabling characterization of valence patterns across specialties and healthcare systems. We benchmarked multiple classification strategies (zero-shot prompting, in-context learning, and supervised fine-tuning) across encoder-only models (GatorTron) and generative large language models (Llama). Fine-tuning with lexically primed inputs consistently outperformed prompting approaches. GatorTron achieved an F1 score of 0.96 on the OB-GYN test set, outperforming larger generative models while requiring minimal prompt engineering and fewer computational resources. External validation on MIMIC-IV revealed limited cross-domain generalizability (F1 < 0.70, 44% drop). Training on the broader MIMIC-IV dataset improved generalizability when testing on OB-GYN (F1 = 0.71, 11% drop), but at the cost of reduced precision. Our findings demonstrate that fine-tuning outperforms prompting for emotional valence classification and that models must be adapted to specific medical specialties to achieve clinically appropriate performance. The same terms can carry different emotional valences across specialties: words with clinical meaning in one context may be stigmatizing in another. For bias detection, where misclassification risks undermining clinician trust or perpetuating patient harm, specialty-specific fine-tuning is essential to capture these semantic shifts. * Equal contribution.

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

TL;DR

Abstract

Paper Structure (16 sections, 2 figures, 10 tables)

This paper contains 16 sections, 2 figures, 10 tables.

Introduction
Related Work
Methods
Lexicon Creation
Chunk extraction
OB-GYN
MIMIC-IV
Chunk Annotation
Language Models Tuning Strategies
Encoder-based Models
Generative Models
Results
Emotional Valence in Clinical Text Varies by Medical Specialty
Identification of Emotional Valence in Clinical Text via Language Models
Discussion
...and 1 more sections

Figures (2)

Figure 1: Multi-stage methodology outline.
Figure 2: Mean lexicon scores for stigmatizing and privileging valence are shown alongside emotional valence annotation labels for OB-GYN (panel a) and MIMIC-IV (panel b) text chunks. Solid dots represent annotated valence labels, while vertical lines indicate the range of emotional valences observed in clinical note chunks relative to the original lexicon scores. Blue dots denote instances labeled as neutral during real-world annotation. Solid black dots represent words that were not assigned a stigmatizing or privileging score in the lexicon but were labeled as such during annotation of real-world text chunks.

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

TL;DR

Abstract

Fine-Tune, Don't Prompt, Your Language Model to Identify Biased Language in Clinical Notes

Authors

TL;DR

Abstract

Table of Contents

Figures (2)